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When right beats might 


The final act ina long-running Italian saga should bring tighter controls on unproven stem-cell 


therapies, both at home and abroad. 


hat impact can scientists have on the murky world of politics? 

When the Italian stem-cell researcher Elena Cattaneo was 

appointed senator in her nation’s parliament in 2013, she 
carried the hopes of her colleagues that she could make a difference. 
Italian courts and politicians were authorizing unproven and potentially 
dangerous stem-cell therapies. The emotions of vulnerable and desper- 
ately ill people were being played with for commercial gain. Reporters, 
including those from this journal, who tried to expose the shameful situ- 
ation were lied to, obstructed and threatened with legal action. Signora 
Cattaneo went to Rome. Together with other stem-cell researchers, she 
helped to bring an end to the whole sorry affair. Brava! 

In her research at the University of Milan, Cattaneo investigates the 
conversion of embryonic stem cells into mature nerve cells, and how 
this might one day be used to treat neurological diseases. Across the 
world, the development of such treatments, and how they should be 
tested, introduced and regulated, is pitting careful, evidence-based 
approaches against medical opportunists. Until recently, Italy was an 
example of how to get it wrong. Now it has an opportunity to show how 
it — and the rest of the world — can get it right. 

For seven years, the Stamina Foundation sold unproven stem-cell 
therapies to Italian people as a panacea for any number of conditions. A 
report into the affair published by the Italian Senate last week analyses 
what went wrong. It details the complicated history of the affair, and 
identifies the sorry list of characters — including many politicians — 
who must share the blame. 

The report makes ten sensible proposals to fix the system. Politicians 
must now fine-tune and enact them. Stamina’s is not the only unproven 
therapy to be indulged by the Italian state in recent times, but it needs 
to be the last. Scientists and policy-makers in other countries would be 
wise to take the lessons on board, too, because such lessons show that 
being right is not enough to stop harm from being done. 

Most strikingly, the Senate investigation found no systematic failure 
in the state’s technical agencies mandated to protect the public. The 
message from scientific experts in these agencies was loud and clear: 
that the Stamina claims had no merit and the technique carried consid- 
erable risks. But this expertise was ignored by other pillars of the state: 
the legislature and judiciary. 

In 2012, the Italian Medicines Agency had declared the Stamina 
treatment unsafe, and ordered the closure of the company’s labora- 
tory in a Brescia hospital. But, egged on by Stamina’s relentless cam- 
paigning, more than 450 people took to local courts and demanded the 
therapy anyway, on compassionate grounds. About half of them got it. 
Politicians, fearing an electoral backlash if they seemed to be ignoring 
patients, followed — even though they knew that a police investigation 
was under way. 

The politicians went further. Despite formal advice from their own sci- 
entific agencies, they approved ministerial decrees to promote the Stam- 
ina treatment, including one that set up a state-sponsored clinical trial. 


Scientists around the world looked on incredulously as the treatments 
continued until August 2014, when a Turin court finally ordered the 
confiscation of equipment and cells from the Stamina laboratory. 

A small group of scientists from across Italy, including Cattaneo, 
fought relentlessly and at great personal cost against the politically 
powerful Stamina supporters. Their story is told in a 2014 Nature Com- 
ment (E. Cattaneo and G. Corbellini Nature 510, 333-335; 2014). When 
Cattaneo was appointed to the Senate by former Italian president Gior- 
gio Napolitano, this fight acquired another dimension. Her first action 
was to push for the Senate investigation. The 15-strong commission 
began work at the end of January last year, sifting through documents 
and holding 25 hearings to assess evidence that anti-Stamina cam- 
paigners would have been unable to find. 

One of the report’s proposals is that any future court that is asked to 
recommend an unauthorized treatment on compassionate grounds 
should have a representative of the health ministry and the state prose- 

cutor there, to make sure that all arguments 


“The Stamina are heard. Another is to introduce ‘Daubert 
case has beena standards into Italian courts, to regulate the 
disgrace to Italy, quality of scientific expertise. These stand- 
but it shows ards, which are used in US courts, require 
theinfluence judges to ensure that expert scientific tes- 
that individual timony is based on knowledge that is the 
scientists canhave product of sound scientific methodology. 

in fighting anti- The report also proposes changing the 


decrees and regulations that relate to com- 
passionate use of unauthorized therapies, to 
close loopholes that could lead to abuse. It suggests new rules to ensure 
that ethics committees are truly independent, and recommends guide- 
lines for media reporting similar to those adopted by the BBC last year. 

Italian scientists are often disheartened by the lack of respect for 
science in their country. They are still reeling from the manslaughter 
verdict meted out by a court to seismologists who had advised the gov- 
ernment before the major 2009 earthquake in L'Aquila; the convictions 
were overturned on appeal last year. Moreover, the Stamina case has a 
close parallel in the notorious case of Luigi Di Bella, a physician who 
claimed in the 1990s that a mixture of molecules such as somatostatin, 
together with vitamins, could cure cancer. 

The Stamina case has been a disgrace to Italy, but it shows the influ- 
ence that individual scientists can have in fighting — even against 
seemingly impossible odds — anti-science forces. And as if to underline 
the point that science can prevail in the most hostile of environments, 
a day after publication of the Senate report, the European Commis- 
sion formally authorized the Western world’s first-ever approval for a 
stem-cell therapy: a treatment for a rare type of blindness that has been 
developed entirely by Italian scientists, working exclusively in Italy. 

It is not just in the political world that researchers can help others to 
see more clearly. m 


science forces.” 
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| THIS WEEK | EDITORIALS 
No strings 


Details of a climate-change sceptic’s links to the 
energy industry make worrying reading. 


including Nature, that link the global-warming sceptic Willie 

Soon, a researcher at the Harvard-Smithsonian Center for Astro- 
physics (CfA) in Cambridge, Massachusetts, to funders in the energy 
industry and conservative circles. 

The files include research contracts and year-end reports, and 
provide new details on the kind of “deliverables” that Soon was provid- 
ing for his funders. The group that released them, the Climate Inves- 
tigations Center in Alexandria, Virginia, raised legitimate questions 
about whether Soon had properly disclosed this funding to journals 
that published his work. The CfA has responded by launching an 
investigation. 

Willie Soon has been a poster child for the small community of 
climate-change sceptics for more than a decade. Environmentalists 
and other watchdogs have examined and exposed his industry fund- 
ing countless times. Unknown until now were the explicit details of 


\ arlier this week, documents were passed to the news media, 


the strings that come with such funding. In some cases, these strings 
included requirements that Soon show copies of proposed publications 
to Southern Company, a major electric utility that has given him nearly 
US$410,000 since 2006, for input before publication. The company 
did not have the right to require changes, but another provision pre- 
vented Soon and the CfA from revealing its involvement without prior 
notification. This is troubling indeed. 

Many scientists receive funding from industry as well as from foun- 
dations. Private money for science can often have an agenda, and this 
is why transparency matters so much. (Global-warming sceptics say 
that government funding has the same taint, but this comes with an 
assumption of disclosure.) 

CfA director Charles Alcock said that agreeing to provisions to 
limit disclosure was a mistake, and one that the centre will not repeat. 
Although it has no explicit rules on disclosure, the centre does expect 
scientists to follow the publishing rules that journals set out. 

One thing does not add up. The CfA, after all, is launching an inves- 
tigation into one of its own staff members on the basis of the evidence 
of its own documents, but only after it was forced to hand them to an 
environmental group under a Freedom of Information Act request. 
Whether or not Soon fully disclosed the source of his funding to all 
of the journals remains unclear, but the basic facts were always there. 

Alcock says that his job is to protect academic freedom at all costs. 
Fair enough. But freedom comes with responsibilities. m 


A sore thing 


The use of technologies that objectively 
measure pain must be carefully monitored. 


appreciate another's pain? Why, as William Shakespeare observed, 

when we encounter: “A wretched soul, bruised with adversity” do 
“We bid be quiet when we hear it cry”? One answer is that there is no 
objective way to measure pain. This is especially true for the enduring 
nature of chronic pain, when the original physical injury — if there 
was one — is long gone. 

For the millions of people worldwide who truly suffer from some 
type of pain, such scepticism means that they go untreated. The prob- 
lem is particularly acute for women who, for reasons that are poorly 
understood, are far more likely to experience chronic pain than men. 
Yet the stereotype of the ‘hysterical’ woman persists. Women are sub- 
stantially less likely than men to receive prescription opiates; instead 
they receive sedatives or antidepressants. 

The signal of pain sits in the brain, and researchers are inching 
towards ‘pain-o-meter brain scans that could replace or complement 
a person's subjective self-reporting of their suffering. An objective 
measure of the brain activity that accompanies chronic pain could 
go along way to changing the public perception that people without 
obvious physical injuries are imagining or faking their condition. 

As we explore in a News Feature on page 474, an increasing num- 
ber of lawyers want to see the technique introduced as evidence in 
court, to help injured clients to prove that they are not malingering. 
Start-up companies are charging ahead to offer commercial scans as 
documentation. 

This development makes many scientists and ethicists nervous. By 
scientific standards, many of the methods have not yet been tested on 
enough people to prove that they are accurate and impossible to cheat. 
In response, lawyers argue — fairly enough — that even if the tests are 
not statistically indisputable, there is no harm in providing one more 
piece of evidence to back up their clients’ claims. 

Much more worrying is the possibility that the technology will be 


I njuries and illness evoke sympathy, so why do we find it so hard to 
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misused, forcing plaintiffs — or even patients — to prove that they are 
in pain to receive compensation, insurance coverage or pain medi- 
cation. Although it is unlikely that physicians will begin routinely 
ordering expensive scans of their patients’ brains before prescribing 
opiates, it is easy to imagine insurance companies wanting proof of 
chronic pain before they shell out for years of treatment. 

Pain-o-meters have a research use as well. Several major pharma- 
ceutical companies are already beginning to use such neuroimaging 
techniques to test new painkillers — a notoriously difficult task 
because of the myriad threads that create the mental tangle of the pain 
experience. Fear, depression, attention and the power of suggestion all 
colour a person's report of sensation. The end result is that promising 
analgesic drugs are thrown out because patients might not think they 
are helping, even if they are actually fixing the biological cause of the 
pain. An objective measuring stick could allow researchers to push this 
jumble aside, treat the pain, and then treat the factor that is making 
the patient think that they are still in pain. At the same time, a better 
understanding of what pain looks like could reveal new drug targets. 

Discussion of chronic pain mirrors other debates in medicine, most 
notably the distinction frequently drawn between physical and men- 
tal suffering. Legal systems and society as a whole persist with the 
idea that mental anguish is somehow different and less important. 
US courts, for instance, allow compensation to be paid for physical 
injury, but rarely emotional injury. 

Laws and attitudes have simply not evolved with the scientific 
understanding of the brain. The idea that illnesses such as depression 
or post-traumatic stress disorder are the result of physically disordered 
brain circuits is catching on. Neuroscientists are comfortable blurring 
the line between the physical and the mental as they search for the 
biological roots of disease. 

Pain does more than blur the lines between the mental and the 
physical; it unites them. Because each individual’s experience is the 
product of so many components, brain scans may not pick up what 
feels very real to the sufferer. Scientists know surprisingly little about 
how exactly chronic pain intertwines with emotional and mental 
processes, which seem to be responsible for 
perpetuating the feeling long after the injured 
nerves have healed. 

Measuring pain might not make it go away, 
but it could still offer some relief. m 
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WORLD VIEW  jennisicos sen 


jihadist group ISIS in the Middle East have thrown the spotlight 

firmly back on radical Islam. Some studies blame the Muslim 
world’s poor and unstable economies for the spread of this fundamen- 
talism. Presumably then, improving the economy could help Muslim 
societies to tackle these radical movements. 

Science can play a big part in this economic development, as it has in 
other places. But because some Muslims see a conflict between science 
and their faith, the philosophical question of how to reconcile the two 
is at the heart of many efforts to advance scientific development in the 
Muslim world. 

Earlier this month, the organization Muslim-Science.com gathered 
a task force of prominent Muslim scholars in Istanbul to discuss the 
importance of reconciliation to “the future of the 
Islamic Project and its ability to embrace modernity”. 

The marriage of faith and science produced 
advances in mathematics, medicine, physics and 
astronomy during the Middle Ages. To recreate the 
enlightened attitudes present during this golden age, 
the scholars argue, Muslim scientists need to build 
broader societal support for science. 

In my view, this focus on personal reconcilia- 
tion is a naive way to address the problem and one 
that is unlikely to have much effect. Reconciliation 
is philosophically and theologically important for 
individual scientists, but will have little impact on 
wider society. It demands critical thinking. And, 
ultimately, the scientists it concerns form a tiny 
part — about 0.01% — of the world’s Muslims. 

Attempts at reconciliation could even make the 
situation worse, and harden anti-science attitudes in 
Muslims. That has happened in Indonesia, home to the world’s largest 
Muslim population, with the publication of the book Adam Was Born, 
which attempts to reconcile Islamic faith and evolution. By re-analysing 
verses of the Koran as positive to science, the author, Agus Mustofa, 
enraged traditional clerics and polarized opinions. 

The problem is that, unlike Catholicism, Islam has no unifying voice 
of authority to rule on koranic interpretation. Although it is legal for 
Mustofa to interpret the Koran, clerics have much more influence, and 
this gives them great power. 

Rather than reconciliation, it is important to monitor and understand 
the way in which political and ideological groups influence how young 
Muslims view science. 

The radical Islamists of ISIS see science as an attribute of their ene- 
mies. They have denounced the great Medieval 


R ecent terrorist attacks in Europe and the continued activity of the 


Muslim scientists Ibn Sina and Ibn al-Nafis as ONATURE.COM 
heretics and atheists. It is clear that such rheto- _ Discuss this article 
ric — if influential — will hold back scientific online at: 
development in Muslim countries. go.nature.com/bd9ngj 


RECONCILIATION 
IS AN 


INDIVIDUAL 
PROCESS, AND 
SOMETHING THAT IS 


INTANGIBLE 
IN THE REALM OF 
POLICY-MAKING. 


_ Focus on political Islamic 


' groups to boost science 


For science to realize its potential in the Muslim world, attitudes need to change 
at a societal level, not just an individual one, says Dyna Rochmyaningsih. 


Here in Indonesia, for example, groups such as the Muslim 
Brotherhood and Hizbut Tahrir have a strong presence in high schools 
and universities, and this gives them profound influence on young Mus- 
lims’ views of the world, including science. 

The influence is not all negative to science. The Muslim Brother- 
hood, although hostile to evolution, encourages talented scientists to 
develop their careers and helps to place them on postgraduate courses 
overseas, typically in Japan. Many of these people return to Indonesia 
as university lecturers. 

However, some Muslim groups think that asking a lot of questions is 
a Jewish trait, and one not to encourage. The convener of the task force, 
Usama Hasan, says that just as the Nazis labelled quantum mechanics 
as Jewish science, so fundamentalist Muslim groups talk about kafir 
science — the science of the unbeliever. 

These groups have much more potential to 
influence the future scientists, engineers and 
politicians of the Muslim world than individual 
researchers. Yet Muslim scholars have largely 
ignored them. 

The organizers of the Istanbul event, for exam- 
ple, also held a meeting on science education in 
the Muslim world. Scholars at the meeting have 
proposed recommendations that will be submit- 
ted in June to the Organisation of Islamic Coop- 
eration. None mentions the potential of political 
and ideological groups. 

The Atlas of Islamic- World Science and Innova- 
tion, a report initiated by Britain’s Royal Society 
and published last December, says that science 
and technology is being held back by the same 
issues in the Muslim world as in many developing 
nations: poor funding, low investment in people and a lack of interna- 
tional collaboration. 

This report, too, ignores the potential of political Islamic groups. Asa 
result, it make the same recommendations for improving science in the 
Muslim world as for, say, South America. We must look at what makes 
each region unique. 

There is no easy way to counter the impact of political Islamic groups 
on science, but it should be studied and accounted for. And it should 
certainly take priority over the reconciliation of science and faith. 

Reconciliation is an individual process, and something that is intan- 
gible in the realm of policy-making. By contrast, hard-line groups can 
influence whole societies. To capitalize on this influence, we might need 
to reform science education in primary schools in the Muslim world, 
and teach young people to think for themselves before they are exposed 
to political ideas. = 


Dyna Rochmyaningsih is a freelance science journalist in Jakarta. 
e-mail: drochmya87@gmail.com 
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RESEARCH HIGHLIGHTS 


PHOTONICS 


Water lens with 
adjustable focus 


Researchers have developed a 
microscopic lens with a focal 
length that can be controlled 
in less than a millisecond. 

Controlling the focus of 
an optical lens is useful for 
microscopy and photography, 
but existing reconfigurable 
lenses are often bulky or slow 
to adjust. Romain Quidant and 
his colleagues at the Institute of 
Photonic Sciences in Barcelona, 
Spain, created a controllable 
lens by placing a disc of gold 
nanorods inside a thin chamber 
of water and putting it on top of 
a conventional lens. 

They used a laser to excite 
the electrons in the nanorods, 
heating the water and changing 
its refractive index to create a 
lens-like effect. The team was 
able to vary the focal distance of 
the lens by tens of micrometres 
with sub-nanometre accuracy, 
and in only 200 microseconds. 
ACS Photonics http://doi.org/2cd 
(2015) 


BIOMATERIALS 


DNA-based gel for 
printing organs 


A gel that can be infused with 
live cells and nutrients makes a 
promising material for printing 
three-dimensional tissues such 
as artificial organs. 

Dongsheng Liu at Tsinghua 
University in Beijing, 
Wenmiao Shu at Heriot-Watt 
University in Edinburgh, 

UK, and their team made two 
water-based inks from peptides 
and synthetic DNA strands 


Selections from the 
scientific literature 


Competing bluebirds make tougher sons 


Female western bluebirds that have to compete 
for nesting sites produce more early-hatching 
male chicks than do females with fewer 


in the offspring — than females facing less 
competitive pressure. Those first eggs also 
tended to produce more males, which can 


competitors. The chicks are also likely to be more 
aggressive. This has long-term effects on the 
range and behaviour of subsequent generations. 
Renée Duckworth and her colleagues at the 
University of Arizona in Tucson discovered 
that female western bluebirds (Sialia mexicana; 
pictured) that live in areas with many neighbours 
and few nesting sites laid eggs containing more 
androgen — a hormone that boosts aggression 


compete for and colonize new territory. When 
the researchers increased the number of nesting 
sites in study areas in western Montana, however, 
the females produced eggs with less androgen, 
and fewer male offspring in the early eggs. 

This eventually allowed the western bluebird 
to boost its numbers and displace its competitor, 
the mountain bluebird (S. currucoides). 

Science 347, 875-877 (2015) 


that form a stable hydrogel 
when mixed. The team printed 
layers of the gel to build up 
millimetre-scale structures 
(pictured). They also infused 
their inks with live mouse 
cells and showed that the cells 
survived the printing process 
and remained functional. 
Unlike some previous 
biocompatible scaffolds, the 
hydrogel is strong enough to 
keep its shape without swelling 
or shrinking, but it can also be 
broken down easily by DNA- 
digesting enzymes. 
Angew. Chem. Int. Edn http://doi. 
org/f24b2n (2015) 
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VOLCANOLOGY 


Sulfur in magma 
gets alift 


Sulfur and metals can hitch 
a ride on bubbles rising in 
molten magma. This could 
explain why some volcanoes 
spew out more sulfur than 
expected, and how metal ores 
can form in the crust nearby. 
Sulfur-rich magma 
normally sinks to the bottom 
of magma chambers. A team 
led by Jim Mungall at the 
University of Toronto in 
Canada used lab studies and 


mathematical modelling to 
show that magma droplets, 
which contain metals, can 
form on the surface of vapour 
bubbles. Droplets that do not 
reach the surface cool and 
form rocks that are rich in 
sulfur, copper and gold. 

In another study, Jon Blundy 
and his team at the University 
of Bristol, UK, used lab 
experiments to conclude that 
sulfur-rich gases interact with 
salty, copper-rich fluids inside 
a magma chamber to form 
thick deposits of copper-based 
minerals — similar to those 
that provide three-quarters of 


NOBUO IWATA/MOMENT OPEN/GETTY 


DONGSHENG LIU AND WENMIAO SHU, ANGEW. CHEM. INT. EDN 


LAUREN TOTH 


the world’s copper. 

Nature Geosci. http://dx.doi. 
org/10.1038/nge02373; 
http://dx.doi.org/10.1038/ 
ngeo2351 (2015) 


Plague came to 
Europe in waves 


The bacterium that causes the 
plague, which killed millions 
of Europeans over four 
centuries from the 1350s, was 
repeatedly reintroduced from 
Asia and did not establish itself 
in European rodents as was 
thought. 

Yersinia pestis bacteria live 
in wild rodents and can infect 
humans when climate changes 
cause rodent populations 
to collapse, triggering 
plague-carrying fleas to find 
alternative hosts. To locate 
plague reservoirs in Europe, 
Nils Christian Stenseth at the 
University of Oslo and his 
colleagues analysed historical 
outbreaks along with tree- 
ring-based records of climate. 
They found no connection 
between fluctuations in 
European climate and plague 
outbreaks, but did find 
links between Asian climate 
changes and outbreaks at 
European trade harbours. 

The authors conclude that 
the plague took about 15 years 
to travel overland to Europe. 
Proc. Natl Acad. Sci. USA 
http://dx.doi.org/10.1073/ 
pnas.1412887112 (2015) 


BIOCHEMISTRY 


Sunlight damages 
DNA in the dark 


Sunlight can cause cancer- 
related DNA damage hours 
after light exposure, owing to 
askin pigment that was largely 
thought to be protective. 
Douglas Brash at Yale 
University School of Medicine 
in New Haven, Connecticut, 
and his team studied how the 
pigment melanin in mouse skin 
cells responds to ultraviolet 
(UV) light. They found that 
UVA radiation, the main type 
of UV light that comes from 
the Sun and from tanning beds, 


creates melanin by-products 
that damage DNA, generating 
DNA derivatives called 
cyclobutane pyrimidine dimers 
(CPDs) for up to three hours 
after light exposure. 

CPDs are associated with 
the skin cancer melanoma, so 
blocking their formation could 
be a way to develop sunscreens 
that can be used after exposure 
to sunlight, the team says. 
Science 347, 842-847 (2015) 


PS CANCER 
Bacteria protect 
tumours 


Bacteria hiding out in tumours 
can shield them from attack by 
the immune system. 

The oral bacterium 
Fusobacterium nucleatum 
has been linked to premature 
birth, rheumatoid arthritis and 
colon cancer. Gilad Bachrach 
and Ofer Mandelboim at 
the Hebrew University of 
Jerusalem and their colleagues 
studied the impact of the 
bacterium on cancer cells. 
They found that F nucleatum 
sticks to tumour cells grown in 
culture and inhibits immune 
cells by activating an immune- 
cell receptor called TIGIT. 
Many immune-cell types 
found in human colon cancer 
and melanoma samples also 
expressed TIGIT, and were 
inhibited by F nucleatum. 

The results could explain 
why certain tumours, 
especially intestinal ones, seem 
to have high levels of bacteria. 
Immunity 42, 344-355 (2015) 


PLANT SCIENCE 


Nectar fends off 
bee parasites 


Floral nectar helps to control 
parasites in bumblebees. 
Plants produce molecules 
called secondary metabolites 
that are harmful to herbivores 
but in some cases can 
also protect animals from 
parasites. To see whether 
such metabolites in nectar 
similarly affect pollinators, 
Leif Richardson at Dartmouth 
College in Hanover, New 
Hampshire, and his team 
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SOCIAL SELECTION 


Popular articles 
on social media 


Scientists cautious about outreach 


Scientists think that they should actively participate in public 
debates about science and technology — but many have 
misgivings about doing so, according to a survey of nearly 
4,000 US researchers. The results of the poll, by the Pew 
Research Center, inspired a fresh online conversation about 
the use of social media in public engagement. “Been saying 
for years scientists need to come down from ‘ivory tower’ and 
engage public,’ tweeted Caleph Wilson, a cancer researcher at 
the University of Pennsylvania in Philadelphia. Ajinkya Kamat, 
a physics PhD student at the University 


> NATURE.COM 
For more on 

popular papers: 
go.nature.com/8mfcqm 


infected eastern bumblebees 


(Bombus impatiens) with an 
intestinal parasite and gave 

the bees one of eight different 
nectar compounds. Four of the 
metabolites reduced the load of 
parasites by 60-80%. 

The compound with the 
strongest effect on parasites, 
anabasine, did not seem to 
boost bumblebee survival, 
but the team says that these 
chemicals in nectar could 
benefit the bee colony asa 
whole by reducing parasite 
spread. 

Proc. R. Soc. B 282, 20142471 
(2015) 


Coral growth shut 
down for millennia 


Coral reefs in the eastern 
Pacific Ocean stopped growing 
for 2,500 years, probably 
because of a change in climate 
four millennia ago. 

Lauren Toth at the Florida 
Institute of Technology 


of Virginia in Charlottesville, tweeted: 
We need more avenues, better incentive 

structure to get scientists in all career 

stages involved in science outreach” 


in Melbourne and her 
colleagues extracted a 2.68- 
metre core from a reef in the 
Gulf of Panama (pictured), 
representing 6,750 years of 
growth. They analysed the 
chemical composition of 

133 skeletons of Pocillopora 
corals in the sample to assess 
coral health, local temperature, 
ocean currents and rainfall. 
They found that roughly 
4,100 years ago, cooler 
temperatures and greater 
rainfall — similar to today’s La 
Nifia weather systems — were 
associated with the beginning 
of a 2,500-year pause in coral 
growth. The health of the corals 
seems to have declined at the 
start of this hiatus. 

The samples also suggest 
that temperature is a key factor 
affecting coral growth. 

Nature Clim. Change http://dx.doi. 
org/10.1038/nclimate2541 (2015) 
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SEVEN DAYS nscesins 


IPCC head resigns 


Rajendra Pachauri resigned as 
head of the Intergovernmental 
Panel on Climate Change 
(IPCC) on 24 February, 

amid allegations of sexual 
harassment. The accusations 
were made by a colleague at 
The Energy and Resources 
Institute (TERI), the non-profit 
organization that Pachauri 
directs in New Delhi, India. 
The IPCC said that panel vice- 
president Ismail El Gizouli 
would step in as acting chair 
for its session in Nairobi this 
week. Pachauri became chair 
of the IPCC in 2002, and 

was scheduled to complete 

his second term in office 

in October. See go.nature. 
com/1ssogm for more. 


US data chief 


The White House appointed 
its first-ever chief data scientist 
on 18 February. DJ Patil, a 
former mathematician who 
helped to coin the term ‘data 
science and who has worked 


Nations asked to tackle tropical diseases 


estimated that it will cost US$34 billion over 
16 years to meet its targets to reduce the burden 


Neglected tropical diseases affect more than 
1.5 billion people worldwide, yet many diseases, 


KATE HOLT/EYEVINE 


at companies including Skype, 
PayPal and eBay, will be in 
charge of US government 
policies around open data, 
based at the Office of Science 
and Technology Policy. He 
will also work on the US 
Precision Medicine Initiative, 
announced in January, which 
seeks to link genomic data 

to health records in order 

to find patient-customized 
treatments. 


Financial conflict 
The Harvard-Smithsonian 
Center for Astrophysics in 
Cambridge, Massachusetts, 
has launched an investigation 
into solar physicist and 
climate-change sceptic 
Willie Soon, after documents 
detailing research contracts 
with the energy industry and 
a conservative foundation 
were released in response to a 
US Freedom of Information 


such as river blindness may be prevented 
simply by taking a pill. In a report released on 
19 February, the World Health Organization 


of the 17 neglected tropical diseases, which 
include leishmaniasis and leprosy. It has called 
on affected countries to boost their spending. 


Act by Greenpeace. Officials 
at the Climate Investigations 
Center in Alexandria, 
Virginia, which revealed the 
documents on 21 February, 
allege that Soon failed 

to report these financial 
relationships on numerous 
peer-reviewed papers. See 
go.nature.com/tvgedg for 
more. 


X-ray pioneer dies 
Ernest Sternglass, a pioneer of 
safer X-ray imaging, died on 
12 February, aged 91. Bornin 
Berlin in 1923, Sternglass fled 
Germany to the United States 
in 1938. While at the firm 
Westinghouse, he worked on 
electron-amplification effects 
that were later harnessed in 
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the low-light television camera 
that allowed viewers to watch 
the live Moon landing in 

1969. In the 1980s, while at 

the University of Pittsburgh, 
Pennsylvania, Sternglass and 
colleagues developed digital 
X-ray imaging. He also applied 
his studies on the dangers of 
X-ray exposure to the health 
effects of atomic-bomb testing, 
about which he publicly 
campaigned. 


Future of graphene 
Europe's €1-billion 
(US$1.3-billion) initiative 

to commercialize graphene, 
which has 142 industrial 

and academic partners in 


23 countries, is on course and 
“providing excellent value for 
money’, its organizers said on 
24 February. In mid-January, 
an independent assessment 
of milestones reached by 

the Graphene Flagship 
project, produced for the 
European Commission, gave 
positive scores all round, a 
spokesperson for the project 
said. That assessment has 

not yet been published, 

but a 200-page road map 
setting out research areas for 
graphene and other two- 
dimensional crystals was 
published this week, covering 
11 themes from energy 
storage to biomedical devices 
(A. C. Ferrari et al. Nanoscale 
http://doi.org/2df; 2015). 


IISC PHOTOGRAPHY CLUB 


SOURCE: NPCC 


POLICY 


Dietary advice 

The US Dietary Guidelines 
Advisory Committee released 
its scientific assessment of 
nutritional guidance on 

19 February. The report 

calls for the eradication of 
limits on dietary cholesterol, 
sets an upper limit for 

added sugar consumption, 
and recommends a diet 
containing more vegetables 
and less meat than are 
consumed by many people — 
both to improve health and to 
reduce environmental impact. 
The scientific assessment is 
available for public comment 
until 8 April. The final 
publication, due out in the 
autumn, will be used to guide 
health recommendations as 
well as public programmes 
such as school lunches and 
food assistance. 


EVENTS 


Indian PhDs protest 
Thousands of Indian PhD 
students are protesting 
(pictured) about delays ina 
hike in research-fellowship 
wages pledged by the Indian 
government last October, after 
protests in July. Last week, halfa 
dozen students went on hunger 
strike. While some funding 
agencies have implemented 

the rise, others have delayed 

its introduction, prompting a 
letter from students to Prime 


TREND WATCH 


Global warming is likely to 


increase sea levels around New 
York City by 56-127 centimetres 


by 2100, and in the worst case 


by 183 cm, warns an assessment 


released by the New York City 
Panel on Climate Change on 


17 February. The study says that 
the city’s sea levels have risen by 
around 3 cm per decade — nearly 
twice the global average — since 
1900. The coastal area affected 
by floods could double by 2100, 


with the boroughs of Queens 


and Brooklyn having the largest 


amount of land area at risk. 


Minister Narendra Modi in 


January. See go.nature.com/ 
adq5pw for more. 


New killer virus 

The US Centers for Disease 
Control and Prevention 
announced the discovery 

ofa new deadly virus on 

19 February. Dubbed Bourbon 
virus after the county in 
Kansas where it was found, 

it is thought to have killed a 
man, aged over 50, who was 
bitten by ticks shortly before 
falling ill. Bourbon virus is 
part of a family known as 
thogotoviruses, which are 
carried by ticks and insects and 
are known to have infected 
only eight people previously. 


Anti-HIV-jab trial 


Trials to test whether injectable 
antiretroviral drugs could 
prevent HIV infection for 
months at a time were launched 


FUTURE FLOOD ZONES 


Sea-level rise in the coming century is 
projected to put the people, economy and 
infrastructure of New York City at great risk. 1% 
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ae York 


by the US National Institutes 
of Health on 19 February. 

The trials, to be carried out in 
Africa, South America and the 
United States, will compare the 
effectiveness of the injections 
against placebo and oral forms 
of the medications. Currently, 
one pill (Truvada) is approved. 
for HIV prevention in the 
United States, but it must be 
taken daily; injectable drugs 
might offer long-lasting 
protection. 


Fast Ebola test 


The first rapid diagnostic 

test for Ebola — which gives 
a result in 15 minutes — 

was approved by the World 
Health Organization in 
Geneva, Switzerland, on 

20 February. The inexpensive 
test, developed by Corgenix 
of Broomfield, Colorado, and 
Robert Garry, a virologist 

at Tulane University in 


Queens 


10 km 


*Using high-estimate 90th percentile projections of sea-level rise 


tus Federal Emergency Management Agency 
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New Orleans, Louisiana, 
detects viral protein and 
does not require electricity 
or refrigeration, making it 
suitable for use in remote 
settings. Currently, blood 
samples from suspected 
cases in West Africa must 

be transported to labs for 
genetic testing, which incurs 
delays. Faster diagnosis is 
considered key to combating 
the epidemic. See go.nature. 
com/e5bml6 for more. 


FUNDING 
Costly drugs 


The National Health Service 
(NHS) in England is doing 
more harm than good by 
paying for expensive new 
drugs, says a team of UK health 
economists. The researchers 
suggest that the threshold used 
by the National Institute for 
Health and Care Excellence to 
assess the cost effectiveness of 
new drugs should be drastically 
lowered (K. Claxton et al. 
Health Technol. Assess. 19, 14; 
2015). The NHS will currently 
fund treatments that cost up to 
£30,000 (US$46 000) for every 
extra year of good-quality life 
they provide. If this figure was 
dropped to £13,000, benefits 
would spread to more people, 
say the authors. 


Plutonium shortage 
It is likely to take longer 

than anticipated to bolster 
NASAs dwindling stocks 

of plutonium-238, which is 
used to power deep-space 
missions. The US Department 
of Energy (DOE) has been 
ramping up production of the 
radioactive isotope, aiming 

to provide NASA with more 
than 1 kilogram per year by 
2021 (see Nature 515, 484-486; 
2014). But limited funding 
means that it will take an 
unspecified time longer, DOE 
space-power director Alice 
Caponiti tolda NASA advisory 
panel on 20 February. With 
limited supplies, NASA has had 
to ration the fuel. 


> NATURE.COM 
For daily news updates see: 
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Birth-cohort studies follow children from the moment they are born to learn what shapes their life trajectory. 


Wanted: 80,000 
British babies 


UK launches massive study to track children from birth, 
months after closure of US counterpart. 


BY HELEN PEARSON 


n ambitious study that will follow 
Az children from cradle to grave 
has launched in the United Kingdom, 
two months after a similar project in the 


United States ended in expensive failure. 
The project aims to track a generation of 


twenty-first-century babies and work out 
which factors in their early lives are important 
in shaping their health and wealth as they grow 
into adults. There are reasons to hope that the 
Life Study will have a happier ending than its 
US counterpart, the National Children’s Study. 

Such ‘birth-cohort’ studies are prized. 
Scientists have used them to extract a stream 


of associations — for example, deducing that 
smoking during pregnancy is linked to poor 
child development, and that children born at 
socio-economic disadvantage are more likely 
to struggle at school. 

Researchers argue that new birth cohorts are 
needed. Children born today, at least in most 
Western countries, enter a world that is increas- 
ingly warmer, more digitized, more ethnically 
diverse and more obese, with wider income 
inequality, than it was even a decade ago. New 
questions and techniques, such as sophisticated 
genetic analyses, also arise as time goes on, 
allowing different information to be gleaned. 

The National Children’s Study aimed to 
follow 100,000 children from birth to age 21, 
but was cancelled in December 2014 before it 
fully launched, 15 years and US$1.2 billion after 
its inception (see Nature http://doi.org/2dh; 
2014). Scientists had started to recruit parents 
and children, but the study struggled to finda 
clear scientific direction, had trouble enrolling 
participants and racked up eye-watering costs. 

Meanwhile, scientists in the United King- 
dom have been getting their own birth-cohort 
study off the ground, although it has attracted 
much less attention than the US study. The team 
involved, led by paediatric epidemiologist Carol 
Dezateux of University College London’ Insti- 
tute of Child Health, officially launched the Life 
Study this week at the House of Lords, to raise 
its profile among politicians and policy-makers. 

Studies in Norway and Denmark are each 
currently following more than 100,000 chil- 
dren, and the United Kingdom already has 
a series of smaller birth cohorts, the first of 
which started in 1946 (see Nature 471, 20-24; 
2011). But the Life Study aims to distinguish 
itself, in particular by collecting detailed infor- 
mation on pregnancy and the first year of the 
children’s lives — a period that is considered 
crucial in shaping later development. 

The scientists plan to squirrel away freezer- 
fulls of tissue samples, including urine, blood, 
faeces and pieces of placenta, as well as reams 
of data, ranging from parents’ income to 
records of their mobile-phone use and videos 
of the babies interacting with their parents. 

The idea of a major new British birth cohort 
was first aired in the mid-2000s, but it took years 
to get organized. Government funding bodies 
agreed in 2011 to pay £38.4 million (US$60 mil- 
lion) until 2019. The scientists have since done 
pilot studies, and late last autumn they started to 
recruit parents into the study proper, aiming > 


26 FEBRUARY 2015 | VOL 518 | NATURE | 463 


© 2015 Macmillan Publishers Limited. All rights reserved 


IN FOCUS 


to enrol all participants by 2018. 

Certain factors make researchers optimis- 
tic that the British study will succeed where 
the US one failed. One is the National Health 
Service, which provides care for almost all 
pregnant women and their children in the 
United Kingdom, and so offers a centralized 
means of recruiting, tracing and collecting 
medical information on study participants. 

In the United States, by contrast, medical 
care is provided by a patchwork of differ- 
ent providers. “I think that most researchers 
in the US recognize that our way of doing 
population-based research here is simply 
different from the way things can be done 
in the UK and in Europe, and it will almost 
always be more expensive here,” says Mark 
Klebanoff, a paediatric epidemiologist at 
Nationwide Children’s Hospital in Colum- 
bus, Ohio, who was involved in early dis- 
cussions about the US study. 

At one stage, US researchers had planned 
to knock on doors of random houses looking 
for women to enrol before they were even 
pregnant. “It became obvious that that wasn't 
going to be a winning formula,’ says Philip 
Pizzo, a paediatrician at Stanford University 
in Palo Alto, California, who co-chaired 
the working group that concluded that the 
National Children’s Study was not feasible. 
“The very notion that someone was going 
to show up on your doorstep as a representa- 
tive from a government-funded study and 
say ‘Are you thinking of getting pregnant?’ 
was not so attractive sociologically.” 

Researchers involved in the UK study say 
that they hope to learn from the challenges 
faced by their US counterparts — they have 
a clear study design and recruitment strat- 
egy — and that they are keen to collabo- 
rate internationally. The major concern is 
whether enough interested parents will sign 
up, something that will become apparent 
only in the next few months. “It’s the known 
unknown,’ says Dezateux. 

US researchers mourn the demise of their 
study. “We have now lost the opportunity 
to remain at the forefront of this field, and 
to collect the crucial life-course data,” says 
Ezra Susser, an epidemiologist at Colum- 
bia University in New York City. But start- 
ing a study that lasts a lifetime comes with 
particular challenges wherever it is done, 
says Pizzo. For instance, about one-third of 
the children in the UK study are expected 
to live to 100. The scientists designing the 
study will be long dead by then, and can 
only hope that the information they col- 
lect will still be useful. “The responsibility 
of getting it right will be enormously sig- 
nificant,” says Pizzo. “If you think of what's 
happened in the last decade — in terms of 
social media, how we connect, the insights 
around basic biology — 100 years from 
now, it’s almost imponderable to think 
where knowledge is going to be.” m 


in 


The Francis Crick Institute sits at the nexus of three central London railway hubs. 


URBAN SCIENCE 


Biology powerhouse 
raises railway alarm 


Central London’s Francis Crick Institute fears that proposed 
train line will disrupt delicate science experiments. 


BY DANIEL CRESSEY 


landmark addition to London’s science 
A= is on a collision course with the 

expansion of the city’s transport sys- 
tem. The Francis Crick Institute warns that 
vibrations and electromagnetic fields gener- 
ated by Crossrail 2, a proposed railway line that 
would skirt the institute, could interfere with 
scientific work there. The teams behind both 
efforts are now seeking a solution. 

Set to employ 1,250 scientists and to have 
an annual budget of more than £100 million 
(US$154 million), the Crick, as it is known, is 
destined to become one of Europe's medical- 
science powerhouses. Construction is due to be 
completed in November. The warnings about 
Crossrail 2 first emerged from the UK Medical 
Research Council (MRC), which provided the 
lion’s share of the Crick’s construction budget 
and will move staff from its National Institute 
of Medical Research (NIMR), in another part 
of London, to the new facility. 

In public documents submitted to a UK 
parliamentary committee and discussed at the 
MRC over the past year, the council warned of 
“potentially serious consequences for opera- 
tion of sensitive scientific equipment” if Cross- 
rail 2 goes ahead, and said that “expensive 
remediation works” might be required. 
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The Crick has now sounded its own warning 
in a statement to Nature: “The Crossrail 2 
trains stopping and starting at the proposed 
station would have an electromagnetic impact 
on our imaging facilities” The imaging equip- 
ment — which includes nuclear magnetic 
resonance spectrometers, electron micro- 
scopes and super-resolution microscopes 
— will support “a significant component” of 
research at the building, says the Crick. 

When the institute was first proposed, some 
scientists at the NIMR raised concerns about 
interference from multiple railway lines that 
already surround the site (see ‘Science in the 
city’). MRC chief executive John Savill says 
that the Crick’s design “fully took into account” 
vibration, noise and electromagnetic interfer- 
ence from these; the Crick team says that it 
deliberately located its imaging facilities away 
from the major Thameslink line. But in 2013, 
Crossrail 2’s proposed route was modified to 
allow the line to connect with Euston station, 
which, says the Crick, would place the railway 
too close to the imaging facilities. 

The government will decide next month 
whether to use the proposed route in the next 
stage of planning; the Crossrail 2 team hopes to 
start construction in 2020 at the earliest. 

“The Department for Transport has 
given assurances that the route selected will 
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not impede any research activities due to 
interference from railway operations or con- 
struction works,’ says Savill. Transport for Lon- 
don (TfL), the local government organization 
that is driving the Crossrail 2 plans, says that it 
takes the concerns “very seriously”. The MRC is 
now working with TfL to find a solution. 

The Crick team would like Crossrail 2 to be 
diverted back to its former route, away from 
the institute. But Michéle Dix, managing direc- 
tor for Crossrail 2, says that plans have already 
changed because of concerns from the Crick. 
Among other things, the planned tunnel has 
been shifted farther underground. “You can't 
just keep on moving it deeper and deeper,’ she 
says, adding that further concerns should be 
addressed through engineering — for example 
by making the tunnel linings thicker. 

Daniel Moylan, a TfL board member and 
a transport adviser to London mayor Boris 
Johnson, says that the mayor is a huge sup- 
porter of both Crossrail 2 and the Crick. The 
latter is a major element in Johnson’s plans to 
promote the capital as ‘MedCity’ — a global 
hub for life-sciences research. Moylan is con- 
fident that the institute’s “legitimate” concerns 
can be allayed by means of technical solutions. 

Objections from Parisian academics in the 
first half of the twentieth century are said to have 
affected the route of the Métro, but there have 
also been more-modern conflicts between sci- 
entific equipment and transport infrastructure. 


SCIENCE IN THE CITY 


The design of the Francis Crick Institute took 
into account existing railway lines, but not the )\ 
latest proposed route for the Crossrail 2 line. 
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In 2013, the proposed route of a US light-rail 
line was diverted after concerns from the Uni- 
versity of Colorado Denver about the effects on 
spectroscopy and microscopes at its medical 
campus in Aurora. In 2011, the University of 
Maryland in College Park dropped its oppo- 
sition to a new ‘Purple Line’ link to the Wash- 
ington DC Metro once the local transportation 
agency agreed to bury and shield power lines. 
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Other labs have depended on engineering 
solutions. The New York Structural Biology 
Center, located near a number of subway lines, 
placed its most sensitive equipment on concrete 
slabs attached directly to Manhattan bedrock. It 
experiences no problems from vibration now, 
says executive director Willa Appel. If there 
is no solution in London for the Crick, Appel 
says, “tell them they're welcome to come here”. m 
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ARTIFICIAL INTELLIGENCE 


DeepMind algorithm beats 
people at classic video games 


Computer that learns from experience provides a way to investigate human intelligence. 


BY ELIZABETH GIBNEY 


eepMind, the Google-owned artificial- 
D intelligence company, has revealed how 

it created a single computer algorithm 
that can learn how to play 49 different arcade 
games, including the 1970s classics Pong and 
Space Invaders. In more than half of those 
games, the computer became skilled enough to 
beat a professional human player. 

The algorithm — which has generated a buzz 
since publication ofa preliminary version in 
2013 (V. Mnih et al. Preprint at http://arxiv.org/ 
abs/1312.5602; 2013) — is the first artificial- 
intelligence (AI) system that can learn a variety 
of tasks from scratch given only the same, min- 
imal starting information. “The fact that you 
have one system that can learn several games, 


without any tweaking from game to game, is 
surprising and pretty impressive,’ says Nathan 
Sprague, a machine-learning scientist at James 
Madison University in Harrisonburg, Virginia. 
DeepMind, which is based in London, 
says that the brain-inspired system could also 
provide insights into human intelligence. 
“Neuroscientists are studying intelligence 
and decision-making, and here’s a very clean 
test bed for those ideas,” says Demis Hassabis, 
co-founder of DeepMind. He and his colleagues 
describe the gaming algorithm in a paper pub- 
lished this week (V. Mnih 


> NATURE.COM et al. Nature 518, 529- 
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offices of DeepMind: Games are to AI 


go.nature.com/2kqata. researchers what fruit 


flies are to biology — a stripped-back system 
in which to test theories, says Richard Sutton, 
a computer scientist who studies reinforce- 
ment learning at the University of Alberta in 
Edmonton, Canada. “Understanding the mind 
is an incredibly difficult problem, but games 
allow you to break it down into parts that you 
can study,’ he says. But so far, most human- 
beating computers — such as IBM’s Deep 
Blue, which beat chess world champion Garry 
Kasparov in 1997, and the recently unveiled 
algorithm that plays Texas Hold Em poker 
essentially perfectly (see Nature http://doi. 
org/2dw; 2015) — excel at only one game. 
DeepMind’s versatility comes from joining 
two types of machine learning — an achieve- 
ment that Sutton calls “a big deal”. The first, 
called deep learning, uses a brain-inspired 
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architecture in which connections between 
layers of simulated neurons are strengthened 
on the basis of experience. Deep-learning sys- 
tems can then draw complex information from 
reams of unstructured data (see Nature 505, 
146-148; 2014). Google, of Mountain View, 
California, uses such algorithms to automati- 
cally classify photographs and aims to use 
them for machine translation. 

The second is reinforcement learning, a 
decision-making system inspired by the neuro- 
transmitter dopamine reward system in the 
animal brain. Using only the screen’s pixels 
and game score as input, the algorithm learned 
by trial and error which actions — such as go 
left, go right or fire — to take at any given time 
to bring the greatest rewards. After spending 
several hours on each game, it mastered a range 
of arcade classics, including car racing, boxing 
and Space Invaders. 

Companies such as Google have an imme- 
diate business interest in improving AI, says 
Sutton. Applications could include how to best 
place advertisements online or how to prioritize 


stories in news aggregators, he says. Sprague, 
meanwhile, suggests that the technique could 
enable robots to solve problems by interacting 
with their environments. 

But a major driver is science itself, says Has- 
sabis, because building smarter systems means 
gaining a greater 


understanding of “The tricks we 
intelligence. Many use for training 
in computational asystemmight 
neuroscience agree. lead tonew 
Sprague, whohascre- ideas about the 
ated hisown version brain.” 


of DeepMind’s algo- 
rithm, explains that whereas AI is largely irrel- 
evant to neuroscience at the level of anatomical 
connections among neurons, it can bring insight 
at the higher level of computational principles. 
Computer scientist Ilya Kuzovkin at the Uni- 
versity of Tartu in Estonia, who is part of a team 
that has been reverse-engineering DeepMind’s 
code since 2013, says: “The tricks we use for 
training a system are not biologically realistic. 
But comparing the two might lead to new ideas 


about the brain.” A particular boost is likely 
to come from the DeepMind team’s choice to 
publish its code alongside its research, Kuzovkin 
says, because his lab and others can now build 
on top of the result. “It also shows that industry- 
financed research goes the right way: they share 
with academia, he adds. 

DeepMind was bought by Google in 2014 for 
a reported £400 million (US$617 million), and 
has been poaching leading computer scientists 
and neuroscientists from academia, growing 
from 80 to 140 researchers so far. 

Its next steps are again likely to be influenced 
by neuroscience. One project could be building 
a memory into its algorithm, allowing the sys- 
tem to transfer its learning to new tasks. Unlike 
humans, when the current system masters one 
game, it is no better at tackling the next. 

Another challenge is to mimic the brain’s 
way of breaking problems down into smaller 
tasks. Currently, DeepMind’s system struggles 
to link actions with distant consequences — a 
limitation that, for example, prevented it from 
mastering maze games such as Ms. Pac-Man. m 


LOWELL OBSERVATORY 


Researchers seek definition 
of head-trauma disorder 


Guidelines should assist in diagnosis of brain disease seen in retired American footballers. 


BY HELEN SHEN 


ave Duerson suspected that something 
D was wrong with his brain. By 2011, 

18 years after the former American 
football player had retired from the Phoenix 
Cardinals, he experienced frequent headaches, 
memory problems and an increasingly short 
temper. Before he killed himself, he asked that 
his brain be donated for study. 

Researchers who examined it found signs 
of chronic traumatic encephalopathy (CTE), 
a degenerative condition linked to repeated 
head injuries. At least 69 cases have been 
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reported in the literature since 2000, many in 
former boxers and American football players 
(P.H. Montenigro et al. Alz. Res. Ther. 6, 68; 
2014) — heightening public concern about 
concussions during contact sports. Yet much 
about CTE is unknown, from its frequency to 
its precise risk factors and even whether its 
pathology is unique. 

Researchers now hope to take a major step 
towards answering those questions. At Boston 
University in Massachusetts on 25-27 Febru- 
ary, neuroscientists will convene to examine 
the characteristics of CTE in brain tissue from 
post-mortem examinations. They hope to agree 
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ona set of diagnostic criteria for the disease, and 
to assess whether it is distinct from other brain 
disorders, such as Alzheimer’s disease. 

The effort is sorely needed, says Walter 
Koroshetz, acting director of the US National 
Institute of Neurological Disorders and Stroke 
in Bethesda, Maryland, which is organizing the 
meeting. “The definition is the important piece 
that lets you do the rest of the research,’ he 
says. And the stakes are high. CTE is associated 
with memory loss, irritability, depression and 
explosive anger, which are thought to appear 
and worsen years after repeated head trauma. 
Research by Ann McKee, a neuropathologist 
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How 
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Repeated head injuries in American football have been linked to a degenerative brain disorder later in life. 


at Boston University, suggests that CTE often 
occurs alongside Alzheimer’s disease and other 
neurodegenerative disorders. “I think this 
disease is more common than traditionally 
recognized,’ she says. 

Such research helped to spur a major lawsuit 
against the US National Football League (NFL) 
over its handling of head trauma, involving 
some 4,500 former players and a potential settle- 
ment of more than US$765 million. The find- 
ings have led researchers worldwide to examine 
the risks of head trauma in other sports, includ- 
ing professional rugby and amateur soccer. 

To make CTE studies more systematic, 
McKee has proposed pathological criteria and 
disease stages defined by the severity of a per- 
son’s symptoms (A.C. McKee et al. Brain http:// 
doi.org/2dc; 2012). CTE often involves deterio- 
ration of white matter in the brain, as well as 
abnormal aggregation of a DNA-binding pro- 
tein called TDP-43 — which is also implicated 
in amyotrophic lateral sclerosis. But McKee says 
that CTE’s most striking feature is the clumped 
distribution of an abnormal, hyperphosphoryl- 
ated form of a protein called tau that normally 
helps to stabilize the internal structure of cells. 

Tau deposits are found in the brains of 
elderly people and in those with Alzheimer’s 
disease — leading some sceptics to question 
whether CTE is a distinct disorder. And crit- 
ics such as Christopher Randolph, a neuropsy- 
chologist at Loyola University Medical Center 
in Maywood, Illinois, worry that CTE research 
has a basic sampling bias. The brain banks that 
support many CTE studies largely rely on tissue 
donated by players or their families, who often 
suspect that the individual had a severe neuro- 
logical problem. This tips the balance towards 
those who may have various brain disorders. 

Mistaking other conditions for CTE could 
have serious consequences, says Randolph, who 
worked with the Chicago Bears NFL team until 
2002. “If you are in the grips of major depres- 
sion, your actions may be different if you believe 


you have a fatal neurodegenerative disease and 
youre just going to get worse, versus believing 
that you have a treatable illness,’ he says. 

McKee acknowledges the sampling bias 
but says that acquiring donated tissue from 
healthy individuals will take time, because 
relatively few people request post-mortems in 
the absence of a visible problem. But she is con- 
vinced that many of the athletes’ brains that she 
has received show an unmistakable signature 
that includes concentrations of hyperphospho- 
rylated tau around blood vessels, especially in 
the furrows of the cerebral cortex. This is a 
departure, she says, from the more-uniform tau 
deposition seen in diseases such as Alzheimer’s. 

At the meeting, researchers will put these 
potential CTE signatures to the test. McKee and 
her team have prepared images of tissue slices 
from 25 brains from multiple brain banks. The 
sample, from players and non-players, includes 
suspected cases of CTE and known cases 
of Alzheimer’s disease and other tau-based 
disorders. Neuropathologists will evaluate the 
images without information on the subjects’ 
personal or medical histories, and will present 
their blinded diagnoses at the event. 

If reliable neuropathological criteria can be 
found for CTE, researchers hope to use them 
to re-examine medical records from other 
deceased patients. Currently, many of the 
psychological symptoms associated with CTE 
are difficult to distinguish from other disorders, 
and brain-imaging tests for CTE are still in early 
development. Universal criteria for detecting 
CTE in donated brains could help scientists to 
search for other signatures so that physicians 
can diagnose the disease in living people. m 


CORRECTION 

The News story ‘Language origin debate 
rekindled’ (Nature 518, 284-285; 2015) 
misspelt the name of Paul Heggarty. 
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DWARF PLANETS: 


Framing 
camera Visible and infrared 
spectrometer 


Gamma ray and 
Neutron detector 


GOING THE DISTANCE 
The probes have been in space for . 
similar lengths of time, but have 


covered vastly different distances. 
lon thruster 


DAWN > Vesta orbit: Ceres orbit: 
16 July 2011 - Enters gravitational pull 
Mars fly-by: 5 Sep 2012 


A TALE OF TWO MISSIONS 


DAWN 10 CERES 


NASA's probe will analyse the 
largest unexplored objects in 
the inner Solar System. 


LAUNCH: 27 SEPTEMBER 2007 
TARGET: ASTEROID BELT 


CALL IT THE YEAR of the dwarf planet. In 2015, 
scientists will get their first close-up look at two of the 
Solar System's biggest little rocks. The Dawn mission 
will fly past Ceres, in the asteroid belt between Mars and 
Jupiter, whereas New Horizons will encounter Pluto, the 
infamous ex-planet that orbits the icy reaches beyond 
Neptune. They promise to reveal surprises that could 


redefine how astronomers think of these small bodies. HORIZONS 


NEW HORIZONS 10 PLUTO 


NASA's mission to the far reaches 
of the Solar System will gather 
data on a distant dwarf planet. 


LAUNCH: 19 JANUARY 2006 
* TARGET: KUIPER BELT 
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DIFFERENT JOURNEYS Vesta 


Dawn travelled using a mixture of 
thrusting and cruising on its way to the 
asteroid belt, whereas New Horizons 
blasted nearly directly outwards to Pluto. 


Jupiter 


New Horizons’ speed received 
a boost of 14,400 km ht 
from Jupiter to counter the 
gravitational pull of the Sun. 


WHAT'S IN A NAME? 


There is no faster way to trigger an argument among Solar System 
researchers than to bring up the definition of a planet. For decades, 
Pluto was considered the ninth planet. But in 2006, prompted by 
the discovery of other large Kuiper belt objects, the International 
Astronomical Union redefined what it means to be a planet. Pluto 
was declassified because it has not gravitationally cleared its orbit 
of other large bodies. Instead, Pluto and Ceres now belong to the 
newly created category of dwarf planets, which are allowed to orbit 
in a zone containing similar objects. 


BODIES OF INTEREST 


Atmosphere: 
Both Ceres and Pluto are dwarf planets, but F P 
at first glance they have little in common. Tally, Sierra 9 Earth 
watery hours 
Vesta Temperature: 


—140°C to -70°C 


A rocky asteroid 


with a huge crater Surface: 
at its south pole BeOin Rocky, probably $$ 


with buried ice 950 km 
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Key Ceres facts: 
¢ Discovered in 1801 by Giuseppe Piazzi 
+ Largest object in the asteroid belt 


Questions: 
* How much of it is water? 
* Was it once habitable? 


BY ALEXANDRA WITZE / ILLUSTRATION BY NIK SPENCER 


Largest 


moon of Binary planets 
Pluto 
Charon is so large 


compared to Pluto Pluto 
that the two both 

orbit a mutual 

centre of gravity, 

rather than one Charon 


orbiting the other. 


1,200 km 


Key Pluto facts: 
+ Discovered in 1930 by Clyde Tombaugh 
+ First known Kuiper belt object 


Questions: 
* What does its icy surface look like? 
+ Was it ever geologically active? 


Atmosphere: 
6.4 Earth Tenuous, probably 
days replenished by 


ices sublimating 
from the surface 


Pluto 


-K— 2,320 km ae 


Temperature: 
-240°C to -220°C 


Surface: 
Icy, with methane, nitrogen, 
carbon monoxide 
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The Pluto 


Leslie and Eliot Young have 
spent their lives studying 
Pluto. Now they are gearing 
up for the biggest event of 
their careers. 


BY ALEXANDRA WITZE 


n a spare conference 
room in Boulder, Colo- 
rado, planetary scientists 
Leslie and Eliot Young 
quiz a graduate stu- 
dent to prepare him for 
his upcoming exams. 
They take their task 
seriously, interrupting often as he answers ques- 
tions about Pluto and Neptune’s moon Triton. 

Leslie makes a technical comment about the 
light reflecting off those distant worlds. Then, 
Eliot notes that Pluto and Triton may have 
started out very similar to one another in the 
early Solar System before evolving down dif- 
ferent paths. “It’s a classic case of nature versus 
nurture,’ he says. “They are siblings.” 

So, too, are the Youngs. Eliot and Leslie 
grew up as the oldest children of an astro- 
nautics researcher, and their mutual interests 
converged on one dwarf planet. “They are the 
only brother-sister Pluto team in the Solar Sys- 
tem,’ says Alan Stern, a planetary scientist and 
principal investigator of NASA’s New Horizons 
mission, which has been hurtling towards 
Pluto for the past nine years. 
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siblings 


The Youngs and other Pluto researchers 
will be gearing up over the next few months as 
New Horizons finally nears its quarry, 4.8 bil- 
lion kilometres from Earth. A telescope on the 
spacecraft has already begun capturing fuzzy 
pictures of Pluto, which will grow sharper as the 
probe closes in. And when New Horizons passes 
within 12,500 kilometres of Pluto on 14 July, it 
will provide the first close-up look at the world’s 
icy surface, and the best chance yet to answer 
major questions about the evolution of the outer 
Solar System (see page 468). 

The fly-by will mark a major milestone in 
both the personal and the professional lives 
of the Youngs, who occupy adjoining offices 
at the Southwest Research Institute in Boul- 
der. Over the past quarter of a century, their 
careers have intersected with Pluto science at 
key points, from helping to discover the dwarf 
planet’s atmosphere to making some of the first 
detailed maps of its enigmatic surface. What- 
ever New Horizons finds this year will build in 
large part on work done by the siblings. 

“We've had some ideas about how Pluto 
works for decades now,” says Eliot. “We'll 
finally find out if they are right.” 


FAMILY ORBIT 
When Eliot and Leslie were growing up in 
Newton, Massachusetts, family life revolved 
around their father, Larry Young, a legendary 
researcher at the Massachusetts Institute of 
Technology (MIT) in Cambridge. Young spe- 
cializes in the biological effects of weightless- 
ness, and he trained to fly on the space shuttle 
although he never went into orbit. Eliot, Leslie 
and their younger brother, Robert, sometimes 
played poker with visiting astronauts. 

Larry Young was also a passionate skier who 


BARRY GUTIERREZ 


studied skiing injuries. On most winter week- 
ends, his family took a long car trip to New 
Hampshire that was filled with brain games and 
chatter about mathematics. 

Larry is not surprised that his two oldest 
children pursued science, but he never imag- 
ined them both studying the same dwarf planet 
in the distant reaches of the Solar System. “I 
think it’s the closest they could get to doing 
science fiction and still earn a living?’ he says. 
(Robert ended up in software development.) 

Eliot felt the pull of Pluto first. As a gradu- 
ate student at MIT in the late 1980s, he worked 
with Jim Elliot and Ted Dunham, who were 
building instruments for airborne astronomy 
missions, including ones to study Pluto. But his 
sister, three years his junior, quickly followed. 
One day, she stopped in at his lab to show him 
a piece of computer coding she had done. Even 
though she was still an undergraduate at nearby 
Harvard University, Jim Elliot was impressed 
enough to offer her a job working on software. 

The MIT team specialized in studying dis- 
tant worlds using stellar occultations — when 
an object of interest moves between Earth and 
a background star. By 


measuring how much NATURE.COM 
the light dims, planetary For more on Pluto, 
scientists can determine see: 


the size of the blocking _ nature.com/pluto 


object. And by noting whether the light dims 
abruptly or gradually, they can deduce whether 
that object has an atmosphere. Pluto is so small 
(about two-thirds the size of Earth's Moon) and 
far away (between about 30 and 50 times farther 
from the Sun than Earth is) that astronomers 
need to use every creative technique they can 
think of to tease out information. 


SKYWATCH 

One night in June 1988, several members of the 
MIT group took off from Honolulu, Hawaii, in 
the Kuiper Airborne Observatory, a telescope- 
carrying plane that flew above the obscuring 
effects of Earth’s atmosphere. Astronomers at 
the time suspected that Pluto had an atmos- 
phere, but no one had ever spotted it. 

Leslie Young was not yet a graduate student, 
but she was on the plane to help with the meas- 
urements. She distinctly remembers the excite- 
ment as the star’s light dimmed gradually, and 
Pluto’s long-sought atmosphere was revealed’. 
The discovery, supported by ground-based 
measurements of the same occultation, made 
front-page headlines. When her brother Rob- 
ert asked how it felt to have rewritten the text- 
books, “I told him it felt pretty good,” she says. 

Pluto remained a fuzzy dot on a map of the 
outer Solar System, even as other distant plan- 
ets were coming into focus. NASAs Voyager 2 
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Leslie and Eliot 
Young at the 
Sommers-Bausch 
Observatory in 
Boulder, Colorado. 


spacecraft had visited Uranus in 1986, and 
three years later it swept past Neptune but did 
not go near Pluto. Researchers knew little about 
that world, other than that its surface seemed 
to be mostly ice, rather than rock, and it was 
accompanied by a moon, Charon, which was 
half the size of the dwarf planet itself. (Pluto was 
demoted from planet status in 2006.) 

Helped by her keen coding skills, Leslie went 
on to make a series of major discoveries as part 
of the MIT team, including spotting methane 
in Pluto’s atmosphere’ and nitrogen ice on 
its surface*. “If somebody asks me what’s my 
favourite colour,’ she says, “I say 2.15 microns” 
— the wavelength of the light absorbed by the 
frozen nitrogen on Plutos surface. 

In the years that followed, Leslie developed 
computer models to describe how the surface 
and atmosphere of Pluto interact. Because the 
orbit of the dwarf planet is extremely stretched 
out in an elongated ellipse, the amount of 
sunlight reaching its surface changes mark- 
edly throughout the Pluto year, which lasts 
248 Earth years. When Plutos orbit carries it 
closer to the Sun, methane, nitrogen and other 
ices on the surface sublimate and form a tenu- 
ous atmosphere, roughly one millionth the 
thickness of Earth’s. Some researchers argue 
that as Pluto gets farther away from the Sun in 
the coming years, the gases in the atmosphere 
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will refreeze and drop to the surface, although 
Leslie's latest models suggest that the atmos- 
phere never completely disappears’. 
Occultation studies*® indicate that the den- 
sity of Pluto’s atmosphere doubled between 
1988 and 2002 and has stayed pretty much 
constant since then. So one of New Horizons’ 
major goals at Pluto is to unravel the icy inter- 
play between the surface and the atmosphere. 


FACE OF PLUTO 

The big brother who got Leslie into Pluto has 
made his own mark. During and after his gradu- 
ate studies, Eliot worked to map the face of Pluto 
by taking advantage ofa geometric coincidence. 
Between 1985 and 1990, the orbital planes of 
Pluto and Charon tilted such that the two 
worlds regularly passed in front of one another 
as seen from Earth, allowing astronomers to 
watch a series of mutual eclipses. By measuring 
how Pluto's face dimmed as sections of it disap- 
peared from view, researchers could work out 
which areas were dark and which were light — a 
property called albedo. Eliot was one of several 
scientists piecing together these mosaics to pro- 
duce maps of the dwarf planet’s surface’. 

The maps were far from perfect: “The resolu- 
tion is like somebody with a strong glasses pre- 
scription getting drunk and going to look at the 
Moon, says Eliot. But they provided some of 
the first real knowledge about what Pluto might 
look like. “It was foundational information 
that got people excited about Pluto,” says Marc 
Buie, a Pluto astronomer now at the Southwest 
Research Institute. He competed with Eliot to 
generate the Pluto maps, and gives him credit 
for inventiveness. “He came up with some ways 
of tackling the data that I never would have 
thought of in a million years,’ says Buie. 

Since then, the Hubble Space Telescope has 
managed to make sharper images of Pluto's 
surface. By around May this year, New Hori- 
zons will be close enough to Pluto to capture 
images better than Hubble’s, and the dwarf 
planet will finally begin to come into focus. 
In the highest-resolution pictures, scientists 
should be able to pick out details as small as 
the lakes in New York City’s Central Park. 

Eliot is perhaps best known for his mapping 
work, but says that his most useful contribu- 
tion to Pluto science is a method for modelling 
occultation light curves. He is happiest at the 
interface between technology and space sci- 
ence, especially ifa healthy dose of hardware is 
involved. “I may still be an engineer at heart; he 
says. Dunham, now at Lowell Observatory in 
Flagstaff, Arizona, remembers Eliot building his 
own computers and housing them in cardboard 
boxes while in graduate school. 

In recent years, Eliot has spent less time on 
Pluto and more on pushing technical bounda- 
ries in another frontier of Solar System science 
— sending balloons above Earth’s atmosphere 
to make planetary observations. His propen- 
sity for technical tinkering has served him in 
his favourite hobby, too. He developed a new 


timing system for races at a ski resort near 
Boulder, where he coaches a team. 

Eliot and Leslie’s mother, Jody Williams, is 
not surprised that the siblings have ended up 
working closely together. During their child- 
hood trips to New Hampshire, where they had 
no television or other distractions, Eliot and 


Eliot and Leslie with their parents in 1967. 


Leslie would play for hours, building small 
towns and fantasy worlds with shared rules. 
Now, Williams says, “every time Eliot runs into 
a problem, he calls Leslie, and she never lets 
him down”. 

Today, Eliot retains a measure of his big- 
brother status: he is the gregarious one who 
often speaks for both of them and is known 
more widely among planetary astronomers. 
Leslie, more reserved, sometimes gets noticed 
initially because she is his sibling. “They learn 
I'm his sister and they figure 'm worth listen- 
ing to,’ she says. 

Yet the siblings clearly revel in their close 
relationship. Both usually work some men- 
tion of each other into professional discussions 
within minutes, and they constantly ricochet 
ideas between each other's offices. “They know 
each other better than any of us know either of 
them,” says Stern. 


FLY-BY FRENZY 

As Pluto scientists approach peak excitement 
this year, Eliot will be helping to coordinate 
a cadre of amateur astronomers across the 
Southern Hemisphere who will try to capture 
a major occultation as Pluto passes in front of 
a particularly bright star on 29 June. It is the 
last of these events before the New Horizons 
fly-by, and a crucial data point in the series of 
occultation studies dating back to the mid- 
1980s — the only long-term measurements of 
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how Pluto’s atmosphere has changed. 

Leslie will move to mission control at the 
Johns Hopkins University Applied Physics 
Laboratory in Laurel, Maryland, to prepare 
for intense data gathering. She is a deputy pro- 
ject scientist for New Horizons and the chief 
architect of the details of the fly-by, a job that 
involves a steady diet of teleconferences and 
spreadsheets to coordinate what observation 
will be made by what instrument at what time. 
The mission team has carried out several dry 
runs of the entire encounter, trying to antici- 
pate every possibility in what will be an action- 
packed few days leading up to and after the 
closest approach. “I’ve been working for the 
future for 15 years, and the pay-off is coming 
this summer,’ says Leslie. 

Among other things, she has developed 
alternative trajectories for the spacecraft to 
divert to if it seems to be heading for a particu- 
larly dusty patch of space. With the spacecraft 
moving at nearly 50,000 kilometres per hour, 
collisions with dust particles could endanger it 
or damage instruments. 

Cathy Olkin at the Southwest Research Insti- 
tute, who was part of Jim Elliot's MIT group a 
few years after Leslie and is another New Hori- 
zon’s deputy project scientist, says that observ- 
ing occultations was good training for making 
high-stress, time-crucial measurements. She 
and Leslie have chased Pluto’s shadow across 
islands in the Pacific and during snowstorms in 
New Zealand. “We know the value of telescope 
time and being prepared and having thought 
out what we're going to do at each step in time,” 
says Olkin. “We know we have to get the data.” 

Back in her office, Leslie Young takes down 
her 1978 edition of David Halliday and Robert 
Resnick’s Fundamentals of Physics textbook. 
She opens it to the back, to the reference table 
listing characteristics of Solar System objects. 
For Mercury, Venus, Earth and the rest of the 
planets, the table looks reassuringly full. For 
Pluto, the data column is incomplete. 

One-third of the entries are question marks. 
Many of the rest are flat-out wrong. “Moons: 
none,’ it reads. (Charon had yet to be discov- 
ered.) “Atmosphere: none?’ (Ditto.) 

Leslie runs her finger down the column, 
ticking off the Pluto discoveries that she and 
her big brother have been involved in. Atmos- 
phere, radius, albedo, surface temperature 
— all key to understanding this curious little 
world. In July, she and Eliot hope to help fill in 
many of the remaining question marks. And 
maybe even add some new ones. m 


Alexandra Witze is a correspondent for 
Nature based in Boulder, Colorado. 


. Elliot, J. L. et al. Icarus 77, 148-170 (1989). 
Young, L. A. et al. Icarus 127, 258-262 (1997). 
Owen, T. C. et al. Science 261, 745-748 (1993). 
Olkin, C. B. et al. Icarus 246, 220-225 (2015). 

. Elliot, J. L. et al. Nature 424, 165-168 (2003). 

. Sicardy, B. et al. Nature 424, 168-170 (2003). 
Young, E. F. & Binzel, R. P. Icarus 102, 134-149 
(1993). 


NOOSONS 


LEONARD MCCOMBE/TIME & LIFE PICTURES/GETTY 


| NEWS FEATURE 


THE PAINFUL TRUTH 


Brain-scanning techniques promise to give 
an objective measure of whether someone 
is in pain, but researchers question whether 
they are reliable enough for the courtroom. 


BY SARA REARDON 


nnie is lying down when she answers the phone; she is trying 
to recover from a rare trip out of the house. Moving around for 
an extended period leaves the 56-year-old exhausted and with 
excruciating pain shooting up her back to her shoulders. “It’s 
really awful,” she says. “You never get comfortable” 

In 2011, Annie, whose name has been changed at the request of her 
lawyer, slipped and fell on a wet floor in a restaurant, injuring her back and 
head. The pain has never eased, and forced her to leave her job in retail. 

Annie sued the restaurant, which has denied liability, for several 
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ANDY POTTS 


hundred thousand dollars to cover medical bills and lost income. To 
bolster her case that she is in pain and not just malingering, Annie’s 
lawyer suggested that she enlist the services of Millennium Magnetic 
Technologies (MMT), a Connecticut-based neuroimaging company 
that has a centre in Birmingham, Alabama, where Annie lives. MMT 
says that it can detect pain’s signature using functional magnetic reso- 
nance imaging (fMRI), which measures and maps blood flow in the 
brain as a proxy for neural activity. 

The scan is not cheap — about US$4,500 — but Steven Levy, MMT’s 
chief executive, says that it is a worthwhile investment: the company has 
had ten or so customers since it began 
offering the service in 2013, and all have 
settled out of court, he says. Ifthe scans are 
admitted to Annie’s trial, which is expected 
to take place early this year, it could estab- 
lish a legal precedent in Alabama. 

Most personal-injury cases settle out 
of court, so it is impossible to document 
how often brain scans for pain are being 
used in civil law. But the practice seems 
to be getting more common, at least in 
the United States, where health care is 
not covered by the government and per- 
sonal-injury cases are frequent. Several 
companies have cropped up, and at least 
one university has offered the service. 

The approach is based on burgeoning 
research that uses {MRI to understand the nature of pain — a very 
subjective experience. Scientists hope that the scans can provide an objec- 
tive measure of that experience, and they see potential applications, such 
as in testing painkillers. But many neuroscientists say that the techniques 
are still far from being accurate enough for the courtroom. Critics say 
that the companies using them have not validated their tests or proved 
that they are impervious to deception or bias. And whereas some think 
the technologies will have a place in legal settings, others worry that the 
practice will lead to misuse of the scans. 

“There's a real desire to come up with some more-objective proxy for 
pain,’ says Karen Davis, a neuroscientist at the University of Toronto in 
Canada. But such measures must be extremely accurate, she says. “The 
outcome of having a wrong answer can be quite catastrophic.” 


NEURAL ORIGINS 

The methods that doctors commonly use to assess pain can seem crude. 
People are asked to rate their pain on a scale from one to ten, or choose 
from a row of cartoon faces that go from happy to anguished. These 
measures can help to chart changes in pain, as someone recovers from 
surgery, for example. But each person will experience and rate their pain 
differently, so one person's five could be worse than another's seven, and a 
nine might or might not be bad enough to keep someone from working. 

An objective answer should lie in the brain, where the experience of 
pain is ultimately constructed. And although every experience is dif- 
ferent, pain should share some common elements. Neuroscientist Tor 
Wager at the University of Colorado Boulder has been trying to decipher 
pain’s signature in the brain by placing people in an {MRI scanner while 
they touch a hot plate. As the researchers turn the plate’s temperature 
up and down, they record the activity across different parts of the brain, 
including the sensory regions associated with the hand. From these 
patterns, Wager says, they can predict with better than 90% accuracy 
whether the plate is just warm or painfully hot’. 

But this measures acute pain — the immediate response to an obvious 
stimulus. Chronic pain, like Annie’s, affects hundreds of millions of peo- 
ple worldwide. And although its cause can be obvious, that is not always 
the case. Vania Apkarian of Northwestern University in Chicago, Illinois 
has scanned dozens of individuals soon after a back injury and then 
again over the course of a year or more. The pain went on to become 
chronic in roughly half of those people, and even though they described 


“THERE'S A REAL DESIRE 

TO.COME UP WITH SOME 

MORE-DBJECTIVE PROXY 
FOR PAIN. 
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the pain the same way throughout, Apkarian could detect a shift in the 
pain signature in their brains’. It changed from a signal of activity in 
the insula, which is associated with acute pain, to one of activity in the 
medial prefrontal cortex, which processes cognitive behaviour, and the 
amygdala, which controls emotion. “Our interpretation is that the pain 
is becoming more internalized,’ Apkarian says. 

This and other work suggests that there is an emotional component 
to chronic pain that is not necessarily involved in acute pain. Chronic 
pain and depression often coexist and reinforce one another. And some 
chronic pain can be eased with antidepressant drugs. But Wager cau- 
tions that focusing on these links can be 
treacherous. Suggesting that pain is all in 
the head — even if that is technically the 
case — does not mean that it is imagined 
or faked. “People will always go to that 
black and white line,” he says. 

That line is a particular challenge in 
legal settings. “A person cannot be found 
disabled based on pain unless they can 
point to a specific cause,” says Amanda 
Pustilnik, a legal expert at Harvard Law 
School in Cambridge, Massachusetts. 


ISOLATED INSTANCES 

The United States sees tens of thousands 
of injury lawsuits every year, most of 
which involve claims of unresolved pain. 
But that might be unusually high — countries with national health sys- 
tems, such as Canada, see fewer lawsuits, says Davis. So far, the only pain 
case involving brain-imaging techniques known to have progressed to 
trial involved a truck driver named Carl Koch, whose wrist was burned 
by a glob of molten asphalt in 2005. A year later, he said he was still 
in pain and sued his former employer, Western Emulsions in Tucson, 
Arizona, for damages. 

Koch had had his brain scanned by Joy Hirsch, a neuroscientist who 
was running the {MRI Research Center at Columbia University in New 
York City. Hirsch had developed a method that she says can “tap into” 
chronic pain. Lightly touching the affected wrist provoked a signal in 
sensory regions and other brain areas associated with pain; touching 
the other wrist did not. The test, she says, is a well-characterized way 
to distinguish allodynia — a pain response to a stimulus that does not 
normally cause pain — from imagined pain. 

At the trial, Western Emulsions called Sean Mackey, a neurologist at 
Stanford University in Redwood City, California, as an expert witness. 
Mackey maintained that pain is too subjective to measure in this way 
and that the signature Hirsch was detecting could have been produced 
if Koch had expected to feel pain in the affected wrist or was unduly 
concentrating on it — deliberately or not. Hirsch argued that there are 
known signals for imagined pain that were not apparent in the scans. 

Ultimately, the judge admitted the scan, and the case settled for 
$800,000 — more than ten times the company’s initial offering, accord- 
ing to Koch's lawyer, Roger Strassburg. 

Another issue, Mackey says, is that it might be possible for people to 
cheat the test. In a 2005 study, he instructed volunteers to lie in an {MRI 
scanner and touch a hot plate while he showed them a video of flames 
that became more or less intense on the basis of their brain activity. Given 
this visual feedback, volunteers were able to control the intensity of the 
flames by imagining the pain as being more or less severe than it actually 
was’. Mackey is looking into the technique as a way to control chronic 
pain, but he is also studying whether people can trick the scanner. 

After the Koch case, the use of such techniques began to pick up. 
Hirsch, who is now at Yale University in New Haven, Connecticut, says 
that while she was at Columbia, she had been doing two to three pain- 
related scans per month, many of which were to support lawsuits. She is 
hoping to offer the service at Yale. 

A main criticism of the various techniques being used in civil suits is 
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the paucity of publications to validate them. Hirsch has not published 
anything on her method, but says that she does not think it is necessary. 
The way in which different body parts are represented in the brain has 
been well mapped, she says, and the scans she has done provide no 
further insight than answering whether or not the person was in pain. 

MMT takes a somewhat different approach: it compares scans before 
and after an individual engages in a painful activity. For example, Annie 
was scanned before and after walking around, and the company claimed 
that it could detect a clear pain signal in the second scan. But the com- 
pany’s only publication, led by co-founder and chief science officer Don- 
ald Marks, has been a single case study. After the person did something 
painful, a brain scan revealed particularly strong activity in the insula, 
which is involved in consciousness and self-regulation, and the soma- 


“TE WE ACCEPT THE LOGIC 

THAT THE BRAIN IMAGER 

KNOWS, THEN WE HAVE T0 
ACCEPT THAT ITS GOING 10 
WIN EVEN IN CASES WHEN 


tosensory cortex, which processes sensa- 
tions from the various parts of the body*. 

These regions are involved in pain, 
but they are also involved in many other 
things. “If you went to a Society for Neuro- 
science meeting and walked into any non- 
pain-related slide session, youd see the 
same regions being talked about,’ Davis 
says. Getting a patient such as Annie to 
walk around between scans would not 
only cause her pain, but also increase her 
awareness of her back, which would acti- 
vate the insula. Davis, who does not think 
that pain imaging should be used in court 
for this purpose, says that she finds it dis- 
turbing that Marks’s study cites her work, 
which measured a different kind of brain 
activity. “It's quite shocking for them to be 
quoting studies that don’t back up their 
technology at all,” she says. 

Moreover, the test cannot be validated in 
a single person, Wager says. Any number of 
confounding factors — emotion, expecta- 


tion, or head movement in the scanner, for instance — could account 
for the signals the company sees. To prove that the method is valid, the 
researchers would have to show that the signals differ between people in 
pain and controls, he says, and that there is a biological mechanism that 
accounts for the signal. Without that, “it’s like reading tea leaves”. 

Marks disputes this, saying that numerous studies, including Wager's, 
have shown that fMRI can reliably distinguish between pain states. “My 
work is an application on an individual basis of all the data to date which 
validates this approach,’ Marks says. He also argues that the approach is 
not meant to determine whether or not someone who says they are in 
pain actually is, “I'm taking individuals that everyone agrees have pain 
and providing a visual graphic representation of that pain” 


CLOSE TO MARKET 


Using different techniques, Chronic Pain Diagnostics (CPD) of 
Roseville, California, is planning to offer commercial scans for litigants. 
CPD compares scans taken ofa person’ brain after they received an 
electric shock to a database of images from 30 individuals with and 
without chronic pain. People with chronic pain respond to a stimu- 
lus differently from healthy controls, and the company has developed 
an algorithm that allows it to distinguish between the two with 92% 


WE DON'T WANT IT 10." 


has ways to control for outside factors that could affect its database, such as 
randomizing the order in which the patients are scanned and using people 
of different ages and genders. But he agrees that further experiments are 
needed to determine how well the algorithm works for individual patients. 
England says that the company hopes to start another study soon. 

Scientists’ concerns about the validity of pain scans might not matter 
much to legal professionals and the courts, says Michael Flomenhaft, an 
attorney in New York City who specializes in chronic pain and neuro- 
imaging. “There's a lot of scientific information that can't be stated with 
the level of certainty you'd need to present it at a scientific conference, 
but is confident and valuable in a legal setting.” 

There is, however, evidence that brain scans could be overly persuasive 
to jurors. Research has suggested that the general public is more likely to 


accept poor arguments if they are accom- 
panied by neuroscientific evidence®. In 
the Koch case, Mackey says, “pretty brain 
pictures ended up being very compelling”. 

The efforts to introduce pain imag- 
ing are similar in some ways to attempts 
over the past decade to use {MRI as a lie 
detector. Most researchers question the 
reliability of this technique. It is difficult 
to validate because study volunteers tend 
not to have the same motivations to lie 
as criminal defendants. But that has not 
stopped several companies from try- 
ing — thus far unsuccessfully — to have 
the evidence introduced in US courts. Pain 
imaging has been more successful owing 
to richer research on the topic. And the 
stakes are much lower for a civil case than 
ina criminal trial, so the bar for what con- 
stitutes evidence is lower, according to 
an analysis in the Journal of Law and the 
Biosciences’. 

But some scientists and ethicists are con- 


cerned about where the increasing acceptance of pain imaging might 
lead. Pustilnik worries that it could become a sort of pass-fail test, not just 
forcing litigants to provide proof of their pain, but potentially making ita 
requirement to get prescription medications or insurance coverage. She is 
heading a working group at Harvard that is developing a list of ethical and 
scientific standards for the technologies before they become widespread. 

Levy and Marks insist that their technology is not capable of that. 
“Fundamentally, we can't prove that a patient does not have pain,” Levy 
says, because an individual might still be experiencing pain even if the 
scanner does not show it. 

But that situation may be inevitable, says Stuart Derbyshire, a 
neuroscientist at the National University of Singapore. “If we accept 


the logic that the brain imager knows, then we have to accept that it’s 
going to win even in cases when we don't want it to.” 


Even so, many say that the research should continue to strive for appli- 
cation, including inside the courtroom. “We already make many wrong 
treatment and legal decisions about who is and is not in pain and who 
shouldnt be believed,” Wager says. “If we had new information, that 
could help us do a better job.” m SEE EDITORIAL P.456 


accuracy’. CPD president and co-founder Shaun England says that he 


expects a scan to cost between $5,000 and $6,000. 

Mackey says that the application is interesting and potentially useful if 
the technique is replicated in larger groups. But Apkarian says the sample 
size is too small to determine meaningful differences at this point. Just as 
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Sara Reardon writes for Nature from Washington DC. 
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CONSULTATION ROOM 
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Nurses at the Kenema hospital in Sierra Leone, which contributed data to early efforts to sequence the genome of the Ebola virus from the West Africa outbreak. 


Make outbreak 


research Opell dCCess 


Establish principles for rapid and responsible data sharing in epidemics, 
urge Nathan L. Yozwiak, Stephen F. Schaffner and Pardis C. Sabeti. 


ast April, five months into the 
L largest Ebola outbreak in history, 

an international group of research- 
ers sequenced three viral genomes, sam- 
pled from patients in Guinea’. The data 
were made public that same month. Two 
months later, our group at the Broad 
Institute in Cambridge, Massachusetts, 


sequenced 99 more Ebola genomes, from 
patients at the Kenema Government 
Hospital in Sierra Leone. 

We immediately uploaded the data 
to the public database GenBank (see 
go.nature.com/aotpbk). Our priority was 
to help curb the outbreak. Colleagues who 
had worked with us for a decade were at the 
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front lines and in immediate danger; some 
later died. We were amazed by the surge 
of collaboration that followed. Numerous 
experts from diverse disciplines, including 
drug and vaccine developers, contacted us. 
We also formed unexpected alliances — 
for instance, with a leading evolutionary 
virologist, who helped us to investigate 
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> when the strain of virus causing the 
current outbreak arose. 

The genomic data confirmed that the 
virus had spread from Guinea to Sierra 
Leone, and indicated that the outbreak 
was being sustained by human-to-human 
transmission, not contact with bats or some 
other carrier. They also suggested new 
probable routes of infection and, impor- 
tantly, revealed where and how fast muta- 
tions were occurring’. This information is 
crucial to designing effective diagnostics, 
vaccines and antibody-based therapies. 

What followed was three months of 
stasis, during which no newvirus sequence 
information was made public (see ‘Gaps 
in the data’). Some genomes are known to 
have been generated during this time from 
patients treated in the United States’. The 
number is likely to have been much larger: 
thousands of samples were transferred to 
researchers’ freezers across the world. 

In an increasingly connected world, 
rapid sequencing, combined with new 
ways to collect clinical and epidemiologi- 
cal data, could transform our response 
to outbreaks. But the power of these 
potentially massive data sets to combat 
epidemics will be realized only if the data 
are shared as widely and as quickly as pos- 
sible. Currently, no good guidelines exist 
to ensure that this happens. 


SPEED IS EVERYTHING 

Researchers working on outbreaks — from 
Ebola to West Nile virus — must agree on 
standards and practices that promote and 
reward cooperation. If these protocols are 
endorsed internationally, the global research 
community will be able to share crucial 


information immediately wherever and 
whenever an outbreak occurs. 

The rapid dissemination of results 
during outbreaks is sporadic at best. In 
the case of influenza, an international 
consortium of researchers called GISAID 
established a framework for good prac- 
tice in 2006. Largely thanks to this, dur- 
ing the 2009 HIN1 influenza outbreak, 
the US National Center for Biotechnology 
Information created a public repository 
that became a go-to place for the commu- 
nity to deposit and locate HIN1 sequence 
information’. By contrast, the publishing of 
sequence information in the early stages of 
the 2012 Middle East respiratory syndrome 
(MERS) outbreak in Saudi Arabia high- 
lighted uncertainties about intellectual- 
property rights, and the resulting disputes 
hampered subsequent access to samples. 

Sharing data is especially important and 
especially difficult during an outbreak. 
Researchers are racing against the clock. 
Every outbreak can mobilize a different mix- 
ture of people — depending on the microbe 
and location involved — bringing together 
communities with different norms, in wildly 
different places. Uncertainties over whether 
the information belongs to local govern- 
ments or data collectors present further 
barriers to sharing. So, too, does the absence 
of patient consent, common for data col- 
lected in emergencies — especially given the 
vulnerability of patients and their families to 
stigmatization and exploitation during out- 
breaks. Ebola survivors, for instance, risk 
being shunned because of fears that they 
will infect others. 

Fortunately, useful models for respon- 
sible data sharing have been developed 


Pilgrims in Saudi Arabia try to protect themselves from Middle East respiratory syndrome (MERS) virus. 
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GAPS IN THE DATA 


Genome sequences from the West Africa 
outbreak of Ebola virus were first made publicly 
available in April 2014. Since 99 genomes were 
released in July, data sets have been shared 
sporadically, even though more are known to 
have been generated. 


No new sequences 
were released 
from 2 August to 
9 November. 


Number of publicly released 
Ebola virus sequences 


Cases of Ebola virus disease 
(thousands) 


by the broader genomics community. In 
1996, at a summit held in Bermuda, the 
heads of the major labs involved in the 
Human Genome Project agreed to submit 
DNA sequence assemblies of 1,000 bases or 
more to GenBank within 24 hours of pro- 
ducing them*®. In exchange, the sequencing 
centres retained the right to be the first to 
publish findings based on their own com- 
plete data sets, by laying out their plans for 
analyses in ‘marker’ papers. 

This rapid release of genomic data 
served the field well. New information on 
30 disease genes, for instance, was pub- 
lished before the release of the complete 
human genome sequence. Since 1996, the 
Bermuda principles have been extended to 
other types of sequence data and to other 
fields that generate large data sets, such as 
metabolite research. 


GUIDELINES FOR SHARING 

More-recent policies on data release simi- 
larly seek to align the interests of differ- 
ent parties, including funding agencies, 
data producers, data users and analysts, 
and scientific publishers. Since January, 
for example, the US National Institutes of 
Health has required grantees to make large- 
scale genomics data public by the time of 
publication at the latest, with earlier dead- 
lines for some kinds of data’. 


SOURCES: SEQUENCES, NCBI/VIROLOGICAL.ORG; EBOLA CASES, WHO 
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We urge those at the forefront of outbreak 
research to forge similar agreements, taking 
into account the unique circumstances of 
an outbreak. 

First, incentives and safeguards should 
be created to encourage people to release 
their data quickly into the public domain. 
One possibility is to request that data users 
(and publishers) honour the publication 
intentions of data producers — the ques- 
tions and analyses that they want to pursue 
themselves — for, say, six months. These 
intentions could be broadcast through 
several channels, including citable marker 
papers, disclaimer notices on data reposi- 
tories such as GenBank, and online forums, 
such as virological.org and the EpiFlu data- 
base. Alternatively, data producers could 
publish an announcement about their data 
and their intentions on online forums asa 
resource that can be used by others as long 
as they cite the original source. 

Second, ethical, rigorous and standard- 
ized protocols for the collection of samples 
and data from patients should be established 
to facilitate the generation and sharing of 
that information. A global consortium 
involving the leading health and research 
agencies and the ministries of health of 
engaged nations should work together 
towards establishing these. Ethicists should 
be involved to safeguard subjects’ privacy 
and dignity. Biosecurity experts will also 
be needed to address potential dual-use 
research and other safety concerns. A 
helpful analogue is the approach used by 
the Human Heredity and Health in Africa 
(H3Africa) Initiative, which aims to apply 
genomics to improving the health of African 
populations. Since August 2013, H3Africa 
has used standard consent-form guidelines® 
for collecting DNA samples from subjects 
for genomic studies, regardless of their 
country of origin. 

Lastly, any preparation for future out- 
breaks should include provisions for rapidly 
building new bridges and establishing com- 
munity norms. Successful collaborations in 
genomics and historical data-sharing agree- 
ments have tended to involve a fairly stable 
group of individuals and organizations, 
making norms of behaviour relatively easy 
to establish and sustain. By contrast, out- 
breaks can involve a new cast of characters 
each time, and cases in which the pathogen 
is new to science call for whole new fields 
of research. 


THE KENEMA WAY 

As a first step, we call on health agencies such 
as the World Health Organization, the US 
Centers for Disease Control and Preven- 
tion and Médecins Sans Frontiéres, as well 
as genome-sequencing centres and other 
research institutions, to convene a meeting 
this year — similar to that held in Bermuda 


Quarantine officers rush to test passengers at Tokyo’s Narita airport amid the 2009 swine-flu outbreak. 


in 1996. Attendees must include scientists, 
funders, ethicists, biosecurity experts, social 
scientists and journal editors. 

We urge researchers working on outbreaks 
to embrace a culture of openness. For our 
part, we have released all our sequence data 
as soon as it has been generated, includ- 
ing that from several hundred more Ebola 
samples we recently received from Kenema. 
We have listed the research questions that 
we are pursuing at 
virological.org and 
through GenBank, 
and we plan to pre- 
sent our results at 
virological.org as 
we generate them, 
for others to weigh 
in on. We invite 
people either to join our publication, or 
to prepare their own while openly laying 
out their intentions online. We have also 
made clinical data for 100 patients pub- 
licly available and have incorporated these 
into a user-friendly data-visualization tool, 
Mirador, to allow others to explore the data 
and uncover new insights. 


Kenema means ‘translucent, clear like a 
river stream’ or ‘open to the public gaze”. 
To honour the memory of our colleagues 
who died at the forefront of the Ebola 
outbreak, and to ensure that no future 
epidemic is as devastating, let’s work 
openly in outbreaks. = 


Nathan L. Yozwiak and Stephen F. 
Schaffner are senior staff scientists, and 
Pardis C. Sabeti is associate professor, at the 
Broad Institute and Harvard University in 
Cambridge, Massachusetts, USA. 

e-mail: nyozwiak@broadinstitute.org 
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The revolution is digitized 


Charles Seife digs into three studies of the wild new world of big data. 


he term has been around for almost 

| two decades, but the world only really 

started talking about ‘big data’ in the 

first few months of 2011. We know this 

because we can look it up on Google Trends 
(see ‘Byte marks’). 

Google built its empire on gathering and 
analysing nearly unfathomable depths of data. 
Every query ever typed into the search engine 
is sitting in Google's multi-exabyte data stores. 
These stores also hold the full text of tens of 
millions of books, high-resolution images of 
streets around the world and myriad e-mails, 
videos, word-processing documents and 
spreadsheets. Anything that can be rendered 
in bits and bytes and is accessible to the com- 
pany’s servers will be pushed, filed, stamped, 
indexed, briefed, debriefed and numbered 
by semi-autonomous information-gathering 
agents. Enter ‘big data’ into the Google Trends 
website and, a fraction of a second later, a 
graph of frequency appears, its line rising 
sharply upwards in the first quarter of 2011. 
You are distilling that information from a 
colossal data set containing the entire world’s 
search-engine queries for the past ten years. 

Seamless upgrades in computer interfaces 
masked a liminal moment: in a few years, we 
have moved from data that can be created, 
gathered and understood by unaided humans 
— kilobytes, megabytes and gigabytes — into 
the hitherto unimaginable realm of petabytes 
and exabytes, gathered at terahertz speeds and 
processed almost as quickly. The transition 
has moved beyond scale to revolution. 

In Big Data, Li ttle Data, No Data, informa- 
tion-studies specialist Christine Borgman 
looks at big data through a fairly narrow lens: 
academic research. Each day, scientists grap- 
ple with ever more appalling volumes of data. 
The ATLAS detector on the Large Hadron 
Collider at CERN, Europe’s particle-physics 
laboratory near Geneva, Switzerland, has 
to sort through dozens of terabytes of data 
every second while it is running — and filter 
that down by five orders of magnitude before 
humans can deal with it. Next-generation 


2007 


Big Data, Little Data, No Data: Scholarship 
in the Networked World 

CHRISTINE L. BORGMAN 

MIT Press: 2015. 


Data-ism: Inside the Big Data Revolution 
STEVE LOHR 
Oneworld: 2015. 


Data and Goliath: The Hidden Battles to 
Collect Your Data and Control Your World 
BRUCE SCHNEIER 

W. W. Norton: 2015. 


telescopes such as the Square Kilometre 
Array will be gathering exabytes of data each 
day — an amount that would have filled the 
total storage capacity of all the world’s infor- 
mation-carrying devices (including books, 
photos and videos) up to the mid-1980s. 

Borgman is something of a data anthropol- 
ogist. She goes among researchers in physi- 
cal sciences, social sciences and humanities 
alike to find out how they collect, handle and 
share the flood of information. Her treatise 
is interesting, but frustrating. She has diffi- 
culty turning her sizeable data set into a nar- 
rative both broad enough to cover the range 
of topics and deep enough to do justice to 
them. All too often, she seems to give a quick 
nod to essential elements. For example, she 
mentions open publications and data, but 
provides no hint of the battles around them 
in the research and publishing worlds. She 
offers key insights — that there are different 
dynamics to publishing research results and 
raw data, and that it is shortsighted to focus 
on releasing new data sets rather than on how 
to preserve and reuse the data. But the book 
might have said much more. There is nary 
a word about the huge controversy around 
incursions of commercial entities into the 
gathering, dissemination and control of 
scholarly data. 

In Data-ism, Steve Lohr goes after the com- 
mercial implications of big data, but through 
an equally narrow lens. As a veteran technol- 
ogy and business reporter, he is attracted to 
the story of how data can help to root out 
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BYTE MARKS 


The surge of interest in big 
data since 2011 can be clearly 
traced in Google's archive of 


2013 


search terms. 


2015 


inefficiencies that stop businesses reaching 
their potential. He gives the example of 
McKesson, a drug and medical-supply dis- 
tributor that used its archives of product and 
shipping data to create a supply-chain model. 
That led to a billion-dollar decrease in inven- 
tories and a sizeable jump in efficiency, show- 
ing, as Lohr says, “data really being used to... 
make better decisions, ones that trump best 
guesses and gut feel, experience and intuition”. 
Alas, Data-ism is very much a conventional 
business book, full of anecdotes, mini-profiles 
and aphorisms that grow ever less compelling, 
however well they would go over at a TEDx 
talk. Lohr’s journalistic instincts often seem to 
betray him. He is unimpressed with the mas- 
sive data-collecting and consumer-profiling 
of information giant Acxiom, yet bowled over 
by a seemingly conventional personality- 
horoscope program that snaffled up Twitter 
feeds, and, for 81% of subjects, “pretty much 
matched the results of their formal tests for 
personality type, basic values, and needs”. 
Neither Borgman nor Lohr truly grapples 
with the immensity of the big-data story. At 
its core, big data is not primarily a business 
or research revolution, but a social one. In 
the past decade, we have allowed machines 
to act as intermediaries in almost every 
aspect of our existence. When we com- 
municate with friends, entertain ourselves, 
drive, exercise, go to the doctor, read a book 
—acomputer transmitting data is there. We 
leave behind a vast cloud of bits and bytes. 
Bruce Schneier, a security analyst known 
for designing the Blowfish block-cipher 
algorithm — a fast and flexible method of 
encrypting data — grasps this revolution’s 
true dimensions. In Data and Goliath, he 
describes how our relationships with govern- 
ment, corporations and each other are trans- 
formed by ordinary, once-ephemeral human 
interactions being stored in digital media. The 
seemingly meaningless, incidental bits of data 
that we shed are turning the concept of pri- 
vacy into an archaism, despite half-hearted 
(and doomed) regulations to protect “per- 
sonally identifiable informatiom: As science- 
fiction pioneer Isaac Asimov wrote some 
30 years ago: “Things just seem secret because 
people don’t remember. If you can recall every 
remark, every comment, every stray word 
made to you or in your hearing and consider 
them all in combination, you find that every- 
one gives himself away in everything.” 
Schneier paints a picture of the big-data 
revolution that is dark, but compelling; one in 
which the conveniences of our digitized world 
have devalued privacy. Interest in privacy has 
dropped by 50% over the past decade — at 
least according to Google Trends. m 


Charles Seife is a professor of journalism at 
New York University, and the author of books 
including Proofiness and Virtual Unreality. 
e-mail: cs129@nyu.edu 
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Books in brief 


Women After All: Sex, Evolution, and the End of Male Supremacy 
Melvin Konner W. W. NoRTON (2015) 

The mammalian body plan is basically female, and maleness is 

a syndrome. So declares anthropologist Melvin Konner in this 
biologically based study (although recent research points to 
complexities; see Nature 518, 288-291; 2015). Positing that women 
are more altruistic and pragmatic — and so are best-equipped for 
the future — Konner mines evolution and anthropology to probe 
gender identities in the light of biology, sexual conflict across species 
and more. The provocative scenarios he lays out include a man-free 
world where women reproduce using DNA from other women’s eggs. 


The Last Unicorn: A Search for One of Earth’s Rarest Creatures 
William deBuys LITTLE, BROWN (2015) 

Discovered in 1992, the saola (Pseudoryx nghetinhensis) is one of the 
rarest large mammals, a beautiful ruminant found in the mountains 
between Laos and Vietnam. In 2011, nature writer William deBuys 
and field biologist William Robichaud set out to gauge poaching 
pressures on the saola. DeBuys’ account of destitute villages and 
endangered animals left to die in snares is a familiar narrative of 
conservation in poor countries. But, like Peter Matthiessen’s 1978 
The Snow Leopard (Viking), this is less an homage to an iconic species 
than a meditation on our compulsion to harry and hem in the wild. 


The Powerhouse: Inside the Invention of a Battery to Save the World 
Steve LeVine VIKING (2015) 

Journalist Steve LeVine’s chronicle of the race to develop a 
rechargeable lithium-ion electric-car battery makes for a propulsive 
techno-saga. The action centres on the Argonne National Laboratory 
outside Chicago, Illinois, where an international group led by 
engineer Jeff Chamberlain worked on the knotty physics. LeVine 
interweaves the geopolitical jostling of the US lab and others in Asia, 
climaxing with Argonne’s 2012 win of more than US$120 million 

to build the ‘Hub’ — a powerhouse intended to create a sustainable 
battery industry. 


Energy Revolution: The Physics and the Promise of Efficient 
Technology 

Mara Prentiss HARVARD UNIVERSITY PRESS (2015) 

In this crisp, evidence-based treatise, physicist Mara Prentiss makes a 
remarkable assertion: that solar and wind power could supply 100% 
of average US energy needs for the next 50 years. Prentiss argues that 
a transition to renewables is probable, given that energy revolutions 
are a historical norm. She stacks up reams of salient data, such as the 
fact that US energy use per capita has remained steady since 1965, 
thanks to increasing fuel efficiency. Although optimistic, her analyses 
of energy sources, combinations, conservation and storage compel. 


House Guests, House Pests: A Natural History of Animals in the Home 


¥.an, Ga Richard Jones BLOOMSBURY (2015) 
HOS © Urban nature lovers relish the sight of birds or hedgehogs in their 
‘GUE wE gardens. “Something odd, though, happens at the back door,” 
SATuna ST notes Richard Jones — and that is zero tolerance for wild unbidden 
Oo ee guests, from tapestry moths to rats. Jones, a fellow of the Royal 


Entomological Society, is a learned guide to this alarming panoply of 
intruders, from the bacon beetle (Dermestes /ardarius), a vagrant of 
old-fashioned larders, to the noisy edible dormouse (Glis glis), which 
can infest the attics of rural houses. Barbara Kiser 
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Correspondence 


Gear students up for 
big medical data 


Combining big data with 
personalized medicine is an 
unprecedented opportunity. It 
will probably be cheaper than 
current practices in the long 
term, particularly given the 
questionable effectiveness of 
many medications (see Nature 
517, 540; 2015). 

Success in this endeavour will 
depend on training the next 
generation of clinicians and data 
scientists to deploy terabytes of 
data to select from a range of 
diagnosis and treatment options. 

Undergraduate and graduate 
bioinformatics programmes need 
to embrace data-analytics courses 
geared towards generating a new 
type of medical specialist — one 
who no longer needs to see 
patients, just their data. 

Ervin Sejdic¢ University of 
Pittsburgh, Pennsylvania, USA. 
esejdic@ieee.org 


Antibodies: the 
solution is validation 


I disagree with Andrew 
Bradbury and colleagues’ 
suggestion that making the 
sequences of commercial 
antibodies publicly available 
could minimize irreproducibility 
in biomedical research (Nature 
518, 27-29; 2015). The real 
solution is proper initial 
validation of antibodies. 

In my view, the reproducibility 
problem is better addressed by 
identifying the good antibodies 
and the reputable companies 
that develop, validate and 
manufacture them — as astute 
scientists do now. Also, journals 
need to mandate the provision 
of detailed validation data, 
protocols and antibody sources 
(clone, catalogue number). 
Independent websites enabling 
the submission of antibody data 
and consumer feedback would 
also help. 

The biggest investment in 
developing a good monoclonal 
antibody is the extensive work 


needed to validate specificity 
and sensitivity across all relevant 
applications. Unlike therapeutic 
antibodies, most research 
antibodies are not sequence- 
patented because the cost is too 
high to be recovered by sales. 
Even if the practical hurdles 
of funding and enforcing a 
sequence-publishing policy 
could be overcome, making 
unpatented antibody sequences 
public would allow them to be 
widely copied, produced and 
sold. This would eliminate the 
incentive for good companies to 
invest in validation. It would also 
allow ‘bad’ antibody sequences 
to contaminate the databases. 
The authors’ proposal could 
therefore disproportionately 
harm the good companies, hurt 
the end-users it is designed to 
protect, and would not solve the 
reproducibility problem. 
Roberto D. Polakiewicz Cell 
Signaling Technology, Danvers, 
Massachusetts, USA. 
rpolakiewicz@cellsignal.com 


Antibodies: validate 
recombinants too 


Recombinant antibodies are pure 
proteins with minimal batch- 
to-batch variability, so could 
provide an important element 
of antibody standardization 
(A. Bradbury et al. Nature 
518, 27-29; 2015). However, 
they must still be functionally 
validated if they are to help solve 
the reproducibility crisis. 
Vendors and researchers 
would have to optimize their 
recombinant antibodies for 
specific applications, because 
of the inherent complexity of 
these molecules and their ability 
to bind non-specifically to 
other proteins carrying similar 
immunological sequences. 
Suppliers and users of such 
antibodies will need specialized 
training in this validation and 
optimization, particularly in 
experimental design and the 
extensive use of controls. 
Leonard P. Freedman Global 
Biological Standards Institute, 


Washington DC, USA. 
[freedman@gbsi.org 


Polluters migrate to 
China’s poor areas 


Bo Zhang and Cong Cao argue 
that China's citizens should have a 
legal right to safeguard the quality 
of their environment (Nature 
517, 433-434; 2015). The wealthy 
would stand to benefit most from 
such a public litigation system, 
causing pollution producers to 
migrate to poorer areas. 

Heavy industry in China is 
already moving out of developed 
eastern regions to the west (see 
X. Bai et al. Nature 509, 158-160; 
2014), where it is damaging the 
local ecology. Industrial waste slag 
has eroded a nature conservation 
area in Xinjiang (see go.nature. 
com/68e1lo; in Chinese), for 
example, and discharge from 
factories has severely polluted 
part of the Tengger Desert at the 
border of the Inner Mongolia and 
Ningxia regions (see go.nature. 
com/nlfbrs; in Chinese). 

Environmental activism by 
residents in affluent areas such as 
Shenzhen, Jinan and Beijing (see, 
for example, Q. Wang Nature 
497, 159; 2013) is accelerating 
this migration of polluters into 
poor areas where environmental 
protection is considered a luxury, 
and where water and soil are 
already badly contaminated. 

Xin Miao Harbin Institute of 
Technology, Harbin, China. 
Yanhong Tang Northeast 
Agricultural University, Harbin, 
China. 

Christina W. Y. Wong The Hong 
Kong Polytechnic University, 
Kowloon, Hong Kong. 
xin.miao@aliyun.com 


Biochar: bring on 
the sewage 


Biochars are carbon-rich 

soil additives derived from 
agricultural and other plant 
waste that could enhance crop 
productivity (see Nature 517, 
258-260; 2015). We suggest that 


biochars could also be produced 
from human sewage — an 
underutilized resource that is 
rich in soil nutrients and carbon. 
Sanitation problems in 
developing regions would be 
alleviated by diverting sewage 
solids into producing biochar, 
made by thermal conversion in 
sealed containers. This might 
even offset the need to install 
conventional sewage-treatment 
infrastructure with its higher 
construction and operation costs. 
Marc Breulmann, Manfred 
van Afferden, Christoph 
Fiihner Helmholtz Centre for 
Environmental Research — UFZ, 
Leipzig, Germany. 
christoph.fuehner@ufz.de 


Biochar: pros must 
outweigh cons 


To optimize the agricultural 

and environmental benefits of 
biochar, a charcoal-rich soil 
additive, we need to overcome 
its potentially undesirable effects 
(see Nature 517, 258-260; 2015). 

For example, it is uncertain 
whether biochar — effectively 
an underground carbon store 
—can help to mitigate carbon 
emissions. A ten-year study of 
boreal forests found that applying 
biochar led to soil degradation 
and increased the activity of soil 
microbes, causing carbon dioxide 
release (D. A. Wardle et al. 
Science 320, 629; 2008). 

Adding blackened biochar can 
also lower the reflectivity (albedo) 
of the soil surface, potentially 
exacerbating climate warming 
(S. Meyer et al. Environ. Sci. 
Technol. 46, 12726-12734; 2012). 

Tilling deep furrows in the soil 
would help to reduce the decline 
in reflectivity and increase the 
efficiency of applied biochar. 
However, this practice could also 
encourage carbon dioxide release. 
Hong Yang University of Oslo, 
Norway. 

Xianjin Huang Nanjing 
University, China. 

Julian R. Thompson University 
College London, UK. 
hongyanghy@gmail.com 
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OBITUARY 


Robert A. Berner 


(1935-2015) 


Geochemist who quantified the carbon cycle. 


rom how minerals form in 
Peessimenc to how carbon 

dioxide is regulated in the 
atmosphere, Robert Arbuckle Berner 
quantified elemental cycles across the 
Earth system. He developed the first 
whole-Earth mathematical model 
of CO, exchange, which revealed 
marked changes in our planet’s past 
atmospheric levels and the rates 
at which natural processes might 
remove anthropogenic CO, from the 
atmosphere. 

Born in 1935 in Erie, Pennsyl- 
vania, Berner died on 10 January in 
New Haven, Connecticut. He was 
encouraged to develop an interest 
in geology by his older brother Paul, 
a (now-retired) petroleum geolo- 
gist. Berner attended the University 
of Michigan in Ann Arbor for his 
undergraduate and master’s degrees. 
There, he spotted fellow geology 
student Betty Kay. They married 
in 1959 and formed an inseparable 
bond, working and writing papers and 
books together for decades. 

Berner received his PhD in 1962 from 
Harvard University in Cambridge, Mas- 
sachusetts. During his thesis work on the 
formation of iron sulfides in sediments, 
he discovered new minerals, among them 
greigite, and invented a type of electrode 
used for measuring sulfide content. His 
adviser was Raymond Siever, known for 
his work on the ancient marine silicon 
cycle, and Berner was also heavily influ- 
enced by Bob Garrels, who championed 
thermodynamics and the concept of 
geochemical cycles. 

He moved as a postdoc to the Scripps 
Institution of Oceanography in La Jolla, 
California. After a short stay as assistant 
professor at the University of Chicago, 
Illinois, in 1965 he joined the faculty at 
Yale University in New Haven, where he 
remained until his retirement in 2006. 

Soon after arriving at Yale, Berner real- 
ized that mineral formation in sediments 
depends on how fast chemicals are trans- 
ported in and out of the sediments and 
how quickly organic matter is oxidized 
by microbes. He developed mathemati- 
cal expressions for these mechanisms and 
so started the field of sediment diagenesis, 
which concerns the biological and chemi- 
cal processes that occur in recently formed 


sediments. Berner and others went on to 


establish how sediment processes ultimately 
control the nutrient balance of the oceans 
and the concentrations of oxygen and CO, 
in the atmosphere. 

In the early 1980s, Berner teamed 
up with Garrels and Antonio Lasaga to 
develop the BLAG (Berner, Lasaga and 
Garrels) model of global atmospheric 
CO, concentrations over geological time. 
This was the first global model aimed at 
quantifying all conceivable processes that 
control CO, exchange, and was largely 
based on Berner’s earlier work on mineral- 
weathering reactions, ocean chemistry and 
early diagenesis. The BLAG model allowed 
geologists to understand for the first time 
how changes in rates of geological processes 
such as continental plate motion, for exam- 
ple, controlled past CO, levels. 

Components since added to the model 
include the influence of biological evolution 
on the history of CO, concentrations, which 
hint at relationships between plant evolu- 
tion and glaciation. These later models also 
reproduce a history of atmospheric oxygen, 
and show, for instance, how past periods of 
elevated oxygen concentrations correlate 
with spells of insect gigantism. 

Bob focused like a laser beam on the 
problem at hand and was able to find 
simple and elegant solutions to complex 
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geological problems. He exuded 
warmth and humanity. In my time 
at Yale as his PhD student, from 
1982 to 1988, he often joined us 
for Friday happy hour at the local 
Whitney Winery, entertaining us 
with stories of science’s colourful 
characters. 

He was a Francophile and loved 
the winery’s outside terrace because 
it reminded him of Parisian cafés. It 
was more expensive to sit there so 
Bob inevitably picked up the bill, 
and sometimes invited us home 
afterwards for a meal and to sample 
his (not so fine) wines and whis- 
key. Fine wine was saved for the 
celebration of new PhDs. Bob and 
Betty invited the newly minted PhD 
to their home for a luxury dinner. 
Afterwards, the entire lab would 
descend, often for a long night of 
ping-pong and poker. 

In material things, Bob had simple 
tastes. When he and Betty inherited 
a powder-blue Chevy Nova, Bob gleefully 
announced that it was his first car with a 
radio, and invited a group of graduate stu- 
dents to ride with him through New Haven 
listening to old-time radio stations. Bob 
later purchased a Honda Civic, which he 
named Harvey, that had a tape deck and 
door alarm. He made a 4-track tape record- 
ing of piano and percussion parts, which, 
when played in the car with front doors ajar, 
mixed perfectly with the car alarm. Bob 
had a party to celebrate, packing graduate 
students into the car to enjoy his composi- 
tion, “Harvey and the four Bobs. 

He was, however, a serious pianist and 
classical composer, devoting much time to 
music after his retirement (see go.nature. 
com/bnq4a2). 

Bob was a loving family man and a dedi- 
cated friend and mentor. He stressed honesty 
and integrity while showing that science was 
great fun. Bob touched the whole geological 
community. He is sorely missed. m 


Don Canfield is professor of ecology and 
director of the Nordic Center for Earth 
Evolution (NordCEE) at the University of 
Southern Denmark in Odense, Denmark. 
He was a PhD student of Bob Berner’s from 
1982 to 1988 at Yale University in New 
Haven, Connecticut. 

e-mail: dec@biology.sdu.dk 
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ARTIFICIAL INTELLIGENCE 


Learning to see and act 


An artificial-intelligence system uses machine learning from massive training sets to teach itself to play 49 classic computer 
games, demonstrating that it can adapt to a variety of tasks. SEE LETTER P.529 


BERNHARD SCHOLKOPF 


large amounts of data have led to progress 

in many areas of science, not least artificial 
intelligence (AI). With advances in machine 
learning has come the development of 
machines that can learn intelligent behaviour 
directly from data, rather than being explicitly 
programmed to exhibit such behaviour. For 
instance, the advent of ‘big data has resulted 
in systems that can recognize objects or sounds 
with considerable precision. On page 529 of 
this issue, Mnih et al.' describe an agent that 
uses large data sets to teach itself how to play 
49 classic Atari 2600 computer games by 
looking at the pixels and learning actions that 
increase the game score. It beat a professional 
games player in many instances — a remark- 
able example of the progress being made in AI. 
In machine learning, systems are trained 
to infer patterns from observational data. 
A particularly simple type of pattern, a map- 
ping between input and output, can be learnt 
through a process called supervised learning. 
A supervised-learning system is given train- 
ing data consisting of example inputs and the 
corresponding outputs, and comes up with a 


| es in our ability to process 
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Output 


Figure 1 | Computer gamer. Mnih et al.' have designed an artificial- 
intelligence system, using a ‘deep Q-network’ (DQN), that learns how to 
play 49 video games. The DQN analyses a sequence of four game screens 
simultaneously and approximates, for each possible action it can make, the 
consequences on the future game score if that action is taken and followed 
by the best possible course of subsequent actions. The first layers of the DQN 


model to explain those data (a process called 
function approximation). It does this by 
choosing from a class of model specified by 
the system's designer. Designing this class is 
an art: its size and complexity should reflect 
the amount of training data available, and its 
content should reflect ‘prior knowledge’ that 
the designer of the system considers useful for 
the problem at hand. If all this is done well, the 
inferred model will then apply not only for the 
training set, but also for other data that adhere 
to the same underlying pattern. 

The rapid growth of data sets means that 
machine learning can now use complex 
model classes and tackle highly non-trivial 
inference problems. Such problems are usu- 
ally characterized by several factors: the data 
are multidimensional; the underlying pattern 
is complex (for instance, it might be nonlinear 
or changeable); and the designer has only weak 
prior knowledge about the problem — in par- 
ticular, a mechanistic understanding is lacking. 

The human brain repeatedly solves non- 
trivial inference problems as we go about our 
daily lives, interpreting high-dimensional 
sensory data to determine how best to control 
all the muscles of the body. Simple supervised 
learning is clearly not the whole story, because 


we often learn without a ‘supervisor’ telling 
us the outputs of a hypothetical input-output 
function. Here, ‘reinforcement’ has a cen- 
tral role in learning behaviours from weaker 
supervision. Machine learning adopted this 
idea to develop reinforcement-learning algo- 
rithms, in which supervision takes the form 
of a numerical reward signal’, and the goal is 
for the system to learn a policy that, given the 
current state, determines which action to pick 
to maximize an accumulated future reward. 
Mnih et al. use a form of reinforcement 
learning known as Q-learning’ to teach sys- 
tems to play a set of 49 vintage video games, 
learning how to increase the game score as a 
numerical reward. In Q-learning, Q*(s,a) rep- 
resents the accumulated future reward, Q*, if 
in state s the system first performs action a, 
and subsequently follows an optimal policy. 
The system tries to approximate Q* by using 
an artificial neural network — a function 
approximator loosely inspired by biological 
neural networks — called a deep Q-network 
(DQN). The DQN’s input (the pixels from four 
consecutive game screens) is processed by con- 
nected ‘hidder layers of computations, which 
extract more and more specialized visual 
features to help approximate the complex 


Image convolutions 


Hidden layers 


Game controller action values 


in each of the 49 games. 


486 | NATURE | VOL 518 | 26 FEBRUARY 2015 


© 2015 Macmillan Publishers Limited. All rights reserved 


analyse the pixels of the game screen and extract information from more 

and more specialized visual features (image convolutions). Subsequent, fully 
connected hidden layers predict the value of actions from these features. The 
last layer is the output — the action taken by the DQN. The possible outputs 
depend on the specific game the system is playing; everything else is the same 


nonlinear mapping between inputs and the 
value of possible actions — for instance, the 
value of a move in each possible direction 
when playing Space Invaders (Fig. 1). 

The system picks output actions on the basis 
of its current estimate of Q*, thereby exploit- 
ing its knowledge of a game's reward structure, 
and intersperses the predicted best action with 
random actions to explore uncharted territory. 
The game then responds with the next game 
screen and a reward signal equal to the change 
in the game score. Periodically, the network 
uses inputs and rewards to update the DQN 
parameters, attempting to move closer to Q*. 
Much thought went into how exactly to do this, 
given that the agent collects its own training 
data over time. As such, the data are not inde- 
pendent from a statistical point of view, imply- 
ing that most of statistical theory does not 
apply. The authors store past experiences in the 
system’s memory and subsequently re-train on 
them — a procedure they liken to hippocampal 
processes during sleep. They also report that 
the system benefits from randomly permuting 
these experiences. 

There are several interesting aspects of 
Mnih and colleagues’ paper. First, the system 
performances are comparable to those of a 
human games tester. Second, the approach 
displays impressive adaptability. Although 
each system was trained using data from one 
game, the prior knowledge that went into the 
system design was essentially the same for all 
49 games; the systems essentially differed only 
in the data they had been trained on. Finally, 
the main methods used have been around for 
several decades, making Mnih and colleagues’ 
engineering feat all the more commendable. 

What is responsible for the impressive per- 
formance of Mnih and colleagues’ system, also 
reported for another DQN’*? It may be largely 
down to improved function approximation 
using deep networks. Even though the size 
of the game screens produced by the emula- 
tor is reduced by the system to 84 x 84 pixels, 
the problem's dimensionality is much higher 
than that of most previous applications of rein- 
forcement learning. Also, Q* is highly nonlin- 
ear, which calls for a rich nonlinear function 
class to be used as an approximator. This type 
of approximation can be accurately made only 
using huge data sets (which the game emulator 
can produce), state-of-the-art function learn- 
ing and considerable computing power. 

Some fundamental issues remain open, 
however. Can we mathematically understand 
reinforcement learning from dependent data, 
and develop algorithms that provably work? 
Is it sufficient to learn statistical associa- 
tions, or do we need to take into account the 
underlying causal structure, describing, say, 
which pixels causally influence others? This 
may help in finding relevant parts of the state 
space (for example, identifying which sets of 
pixels form a relevant entity, such as an alien 
in Space Invaders); in avoiding ‘superstitious’ 


behaviour, in which statistical associations 
may be misinterpreted as causal; and in 
making systems more robust with respect to 
data-set shifts, such as changes in the behav- 
iours or visual appearance of game charac- 
ters*’*. And how should we handle latent 
learning — the fact that biological systems also 
learn when no rewards are present? Could this 
help us to handle cases in which the dimen- 
sionality is even higher and the key quantities 
are hidden in a sea of irrelevant information? 

In the early days of AI, beating a professional 
chess player was held by some to be the gold 
standard. This has now been achieved, and the 
target has shifted as we have grown to under- 
stand that other problems are much harder 
for computers, in particular problems involv- 
ing high dimensionalities and noisy inputs. 
These are real-world problems, at which bio- 
logical perception—action systems excel and 
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machine learning outperforms conventional 
engineering methods. Mnih and colleagues 
may have chosen the right tools for this job, 
anda set of video games may be a better model 
of the real world than chess, at least as far as AI 
is concerned. m 
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The benefits of 
traditional knowledge 


A study of two Balkan ethnic groups living in close proximity finds that 
traditional knowledge about local plant resources helps communities to cope 
with periods of famine, and can promote the conservation of biodiversity. 


MANUEL PARDO-DE-SANTAYANA 
& MANUEL J. MACiA 


nderstanding how human groups 
obtain, manage and perceive their 
local resources — particularly the 
plants they use as food and medicine — is cru- 
cial for ensuring that those communities can 
continue to live and benefit from their local 
ecosystems in a sustainable way. The study 
of these complex interactions between plants 
and people is the aim of an integrative disci- 
pline known as ethnobotany, which is based 
on methods derived mainly from botany and 
anthropology’. Most ethnobotanical research 
reveals that traditional knowledge about local 
edible and healing resources is suffering an 
alarming decline’, especially in Europe’. 
However, writing in Nature Plants, Quave 
and Pieroni* suggest that wild plants still have 
an essential role for communities living in 
the mountains of Kukés, one of the poorest 
districts of Albania. Their results also show 
how preserving local knowledge is linked to 
maintaining biodiversity. 
The mountains of Kukés lie in the Balkans, 
a hotspot of cultural and biological diversity 
that has suffered major political and eco- 
nomic shifts over the past three decades. 
Quave and Pieroni studied two culturally 


and linguistically distinct rural Islamic ethnic 
groups (the Gorani and Albanians) that, 
despite living in close proximity in this region 
and facing similar environmental and eco- 
nomic conditions, have remained relatively 
isolated from one another. The two groups use 
wild plants in different ways, giving the authors 
an opportunity to investigate the role of cul- 
tural factors in shaping how the local flora is 
understood and used in daily life, health prac- 
tices and, ultimately, survival. Among the vari- 
ous quantitative techniques used, the authors 
designed a simple but innovative tool to com- 
pare the cultural similarities and differences 
between the two groups’ use of plant species. 
The researchers report significant variation 
in the plant species used for medicinal pur- 
poses by the two ethnic groups. A plausible 
explanation for this is that the spread of health- 
related lore requires a high degree of affinity, 
because trying a new remedy requires a great 
deal of trust’. Health is a sensitive topic, so 
people accept advice mainly from knowledge- 
able relatives or friends belonging to the same 
ethnic group®. Moreover, many traditional 
remedies have a highly symbolic component, 
and the mechanisms by which they are believed 
to bring about healing can lie — totally or 
partially — in the remedy’s cultural meaning’. 
Quave and Pieroni find only two species, 
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Figure 1 | The dog rose (Rosa canina) is used by both Gorani and Albanian ethnic groups. 


Urtica dioica and Rosa canina (Fig. 1), that 
were widely used by both ethnic groups, and 
both species are edible. Generally, there was 
much more convergence in the food plants 
used by the two groups. The researchers sug- 
gest that this can be explained by the impor- 
tance of wild edible species in ensuring food 
security. The robust local lore concerning 
these plants serves as a reservoir of knowledge, 
preparing the groups to cope with periods of 
famine or the scarcity of staple foods®. When 
food is scarce, cultural boundaries seem to be 
more permeable, because the survival of the 
group is at stake. 

Another issue for consideration is the fact 
that some species are used medicinally by one 
group, but sold to plant traders by the other. 
People assign higher values to species that they 
use in their daily lives than to those that are 
harvested for marketing, which, as the authors 
point out, can have a major effect on the con- 
servation of these resources. The group’s 
relationship with the resource is much more 
intimate in the former case. Indeed, many 
regularly used plant species are of great cul- 
tural significance and have a prominent place 
in the local collective memory. They are part of 
local histories and narratives — they represent 
the essence, personality and identity of their 
community. 

This study demonstrates that cultural values 
have a major effect on traditional local knowl- 
edge. Sustainable exploitation of local bio- 
diversity is much more likely for resources 
that are emotionally valued than for those 
that are used in an impersonal way, as a source 
of income. A report published last year’ pos- 
ited that many indigenous communities that 
have successfully conserved biodiversity in 
their locality do so by combining an extensive 
and experiential knowledge with an intensely 
respectful emotional engagement with nature. 
Furthermore, the report suggests that our 


inclination to conserve biodiversity is a function 
of the number and intensity of our emotional 
attachments. Therefore, if traditional local 
knowledge is forgotten, biodiversity is also in 
danger of being lost, as is happening in some 
sacred forests and habitats that are in the pro- 
cess of being transformed and degraded"”. 
Studies such as Quave and Pieroni’s can help 
to integrate traditional local knowledge with 
efforts to conserve biocultural diversity. By 
focusing on the point of view of people who 
are or have been deeply dependent on their 
local resources, these studies can promote cul- 
turally appropriate, sustainable development 
strategies. Unfortunately, this integration has 
received little support, and the implications of 
integration are yet to be properly evaluated. 
Future ethnobotanical studies should build 
on the innovative approach taken by Quave 


STEM CELLS 


and Pieroni when testing the role of cultural 
factors in the distribution and preservation 
of traditional local knowledge, by comparing 
the authors’ results with larger data sets from 
informants and communities gathered in other 
regions worldwide. 

Finally, the authors have demonstrated that 
quantitative techniques for analysing ethno- 
botanical data can lead to a deeper level of 
understanding within the discipline. We sug- 
gest that quantitative techniques should thus 
be further explored. Ethnobotany has a key 
part to play in studies of how ethnic groups can 
benefit from and coexist with their ecosystems. 
Policy and decision makers should take into 
account the views and traditions of local com- 
munities, particularly in rural regions with 
economic and social instability. m 
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Chasing blood 


Many experiments have probed the mechanisms by which transplanted stem 
cells give rise to all the cell types of the blood, but it emerges that the process is 
different in unperturbed conditions. SEE LETTER P.542 


SIDHARTHA GOYAL & PETER W. ZANDSTRA 


lood is one of the most dynamic tissues 
B: the human body, with millions of 

cells being produced each second. But 
how blood-cell production occurs under 
unperturbed conditions, and which stem and 
progenitor cell types are responsible for stable 
maintenance of the blood, has been unclear. In 
this issue, Busch et al.’ (page 542) use a genetic 
labelling strategy to gain insight into blood-cell 
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production under normal conditions. They 
reveal that an unexpected subset of blood stem 
cells is the major player in day-to-day blood- 
cell production. 

Our understanding of haematopoiesis 
(the process through which all the cells of the 
blood are generated) is built on experiments 
in which blood cells are transplanted into 
recipients whose bone marrow (the source 
of blood-cell production) has been depleted 
through a process called myeloablation. The 


cells’ progeny are then tracked over time’. Such 
work has helped to define key molecular and 
functional properties of blood stem cells, and 
to segregate the cells into different subsets, or 
compartments, on the basis of the time over 
which they contribute to haematopoiesis and 
on the lineages of cells that they can gener- 
ate. At the top of the hierarchy are long-term 
haematopoietic stem cells (LT-HSCs), which 
give rise to short-term haematopoietic stem 
cells (ST-HSCs), both of which are named 
in accordance with the time over which they 
contribute to post-transplantation haemato- 
poiesis. Both LT-HSCs and ST-HSCs exhibit 
self-renewal potential and typically give rise 
to all blood-cell lineages. 

Only recently have we had the tools to 
extend our understanding of post-trans- 
plantation haematopoiesis to unperturbed 
conditions. Busch and colleagues’ technique 
for studying unperturbed haematopoiesis is 
the cellular equivalent of classic ‘pulse-chase 
experiments, in which intracellular molecular 
components are traced by transient labelling 
of molecules to discover biosynthetic path- 
ways. In the current study, the authors engi- 
neered mice such that LT-HSCs expressing the 
gene Tie2 could be genetically labelled with a 


fluorescent protein. Those cells and all their 
descendants will then fluoresce regardless of 
whether or not they express Tie2, allowing 
tracing of the cell lineages arising from the 
labelled stem cells. Following labelling, the 
authors performed a battery of assays at dif- 
ferent times, and used mathematical model- 
ling to analyse the resulting data (which were 
supported by in vivo stem-cell-transplantation 
experiments). This enabled them to define the 
rates of transition between blood-cell com- 
partments, the length of time that cells spent in 
each compartment and the relative cell num- 
bers in the different compartments (Fig. 1). 

One of Busch and co-workers’ central 
findings is that large numbers of ST-HSCs are 
responsible for most blood-cell production 
throughout the lifetime of the animal, with 
LT-HSCs participating to only a limited extent. 
This is perhaps not surprising, given that 
ST-HSCs proliferate faster than LT-HSCs. The 
authors also infer that at least 30% of LT-HSCs 
(around 5,000 cells in an adult mouse) go on to 
give rise to differentiated blood-cell lineages. 
This suggests that, in unperturbed condi- 
tions, the composition of the blood is highly 
polyclonal — that is, it is derived from many 
different stem cells. 
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Figure 1 | A balance in the blood. During blood-cell production (haematopoiesis), long-term 
haematopoietic stems cells (LT-HSCs) give rise to short-term HSCs (ST-HSCs), which eventually produce 
the cells of the peripheral blood system. a, Busch et al.' investigated the numbers of each cell type over time. 
Numbers in each compartment rise during embryonic development to reach steady-state levels, which 

are maintained throughout adulthood. The death of peripheral blood cells owing to the introduction ofa 
cytotoxic agent does not affect the numbers in either stem-cell compartment. If LT-HSCs and ST-HSCs are 
destroyed by myeloablation (depletion of bone-marrow cells), the numbers of peripheral blood cells dip, 
but quickly return to normal. LT-HSCs are derived from many clones under normal conditions, but fewer 
clones typically contribute to post-transplantation haematopoiesis. b, The authors also investigated the 
rate at which LT-HSCs transition to ST-HSCs, and at which ST-HSCs transition to peripheral blood cells 
in unperturbed and post-transplantation conditions (development was not studied). The rate of transition 
out of the ST-HSC compartment is consistently higher than the rate of transition from the LT-HSC 
compartment, implying that ST-HSCs are the major players in unperturbed haematopoiesis. 


NEWS & VIEWS | RESEARCH | 


The high level of polyclonality inferred by 
these experiments is in contrast to the findings 
of many transplant studies, in which a handful 
of individual LT-HSCs typically repopulate the 
entire blood system. However, to some extent, 
this dogma may be associated with the fact 
that often only a few transplanted cells suc- 
cessfully engraft in the bone marrow under 
normal experimental conditions. Busch et al. 
provide evidence suggesting that, unlike the 
case in unperturbed haematopoiesis, LT-HSCs 
are the more important contributors to the 
haematopoiesis that occurs during embryonic 
development or after treatment with a cytotoxic 
agent that kills blood cells. This finding suggests 
that feedback signalling from mature blood and 
progenitor cells’ to the LT-HSC compartment 
is key to controlling transition rates between 
compartments. 

A paper published last year* described the 
use of another genetic technique to track the 
contribution of a few thousand blood-stem- 
cell clones (each tagged with a different molec- 
ular signature) to the peripheral blood under 
normal conditions. A key finding of this work 
was that two blood-cell lineages, myeloid and 
lymphoid, were populated with cells harbour- 
ing different tags. This result agrees with Busch 
and colleagues’ findings, suggesting that ST- 
HSCs that are beginning to become restricted 
to one lineage are the predominant source of 
haematopoiesis under unperturbed condi- 
tions. Also in agreement with Busch et al., the 
paper reported that the peripheral blood was 
highly polyclonal. 

Busch and colleagues’ work, along with 
related papers*” examining unperturbed 
haematopoiesis in vivo, opens up several 
avenues for future research. A study® involv- 
ing transplantation of blood stem cells revealed 
a high degree of variability between the clonal 
populations in the post-transplantation blood, 
in both their size and their ability to give rise 
to different lineages. This variability can be 
attributed to various mechanisms, both intrin- 
sic (different developmental potencies for the 
stem cells) and extrinsic (the surrounding 
microenvironment). Because of our limited 
understanding of clone sizes and lineage-con- 
tribution bias in unperturbed conditions, it is 
unclear how this post-transplantation variabi- 
lity relates to Busch and colleagues’ observa- 
tions under normal conditions. It could be 
that large numbers of contributing ST-HSCs 
are needed to ensure that robust polyclonality 
is maintained during normal haematopoiesis 
— something that is compromised when only 
limiting numbers of stem cells contribute to 
post-transplantation haematopoiesis. Alterna- 
tively, the reported post-transplantation clonal 
variability could be a stem-cell-intrinsic phe- 
nomenon specific to the post-transplantation 
environment. The ability to distinguish 
between LT- and ST-HSCs in vivo will be 
required to address this. 

Another key area for investigation is 
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50 Years Ago 


On August 20, 1964, one of us... 
while trapping for small mammals 
near Listowel, County Kerry, caught 
an unusual ‘mouse. On subsequent 
examination it proved to bea 
member of the family Cricetidae, 
the bank vole, Clethrionomys 
glareolus Schreber — a family 

of mammals hitherto unknown 
from Ireland ... Comparisons 

of cranial characters were made 
with series of British mainland 
and continental specimens ... The 
only detectable difference is that 
the nasals are on average shorter 
and the condyle width greater than 
the British forms, but even here 
there is considerable overlap ... 
Investigations now being carried 
out are aimed at establishing the 
present distribution of this species 
in Ireland. 

From Nature 27 February 1965 


100 Years Ago 


The Medical Committee of the 
British Science Guild has done 

a good work by its resolution 
condemning a notorious anti- 
vivisection advertisement. The 
object of the advertisement was to 
prevent our soldiers from being 
protected against typhoid fever. Ifit 
be asked why any one of the many 
anti-vivisection societies should 
behave in this way, we can only say 
with Dr. Watts that “Satan finds 
some mischief still for idle hands to 
do? ... Few of us are wanting to hear 
Pasteur called a charlatan; few of us 
are wanting anti-vivisection lectures 
and shops. Everybody is sure, who 
is capable of clear thinking, that our 
men of science are neither cruel 
nor stupid. But anti-vivisection 
cannot rest. It must find something 
to attack, something to abuse ... 
We hope that it will be many years 
before anti-vivisection emerges out 
of the public disgrace which it has 
brought upon itself. 

From Nature 25 February 1915 


why ST-HSCs can contribute to long-term 
haematopoiesis in unperturbed conditions, 
but contribute only transiently to post- 
transplantation haematopoiesis. Whether this 
reflects an effect of the post-transplant envi- 
ronment on these cells, or whether it is related 
to the increased proliferation in ST-HSCs 
during repopulation, is not clear. Finally, as 
Busch et al. mention, if these results extend to 
humans, efforts to capture the potential of ST- 
HSCs for clinical transplants could be valuable. 

These areas of uncertainty cannot be 
addressed without the development of 
experimental techniques to mark prospective 
ST- and LT-HSCs in vivo, to quantitatively 
analyse the resulting lineages and model the 
data statistically’. Only then will we be able 
to fully interpret this complex and dynamic 
process. @ 
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A giant in the 
young Universe 


Astronomers have discovered an extremely massive black hole from a time when 
the Universe was less than 900 million years old. The result provides insight into 
the growth of black holes and galaxies in the young Universe. SEE LETTER P.512 


BRAM VENEMANS 


galaxy in the Universe harbours a super- 

massive black hole at its centre. These black 
holes are thought to have formed in the young 
Universe with initial masses of between 100 and 
100,000 times the mass of the Sun!. Over time, 
some of them have grown to be up to billions 
of solar masses by pulling (accreting) inter- 
stellar material from their surroundings and/or 
through merging with other black holes. The 
most massive black holes that have been found 
in the nearby Universe have masses of more 
than 10 billion solar masses””. For comparison, 
our own Galaxy harbours a black hole with a 
mass of between 4 million and 5 million solar 
masses’. On page 512 of this issue, Wu et al. 
report the discovery of a supermassive black 
hole with a mass of a remarkable 12 billion solar 
masses, from a time when the Universe was 
only 875 million years old — that is, about 6% 
of its current age of 13.8 billion years. 

Wu and colleagues identified this monster 
in optical and near-infrared imaging data 
because it was accreting gas at a high rate. This 
gas is pulled towards the black hole by grav- 
ity and can efficiently radiate away part of its 
potential energy. Accreting supermassive black 
holes can therefore be very bright, and can be 
seen across the Universe as luminous sources 


I: is commonly believed that every massive 


490 | NATURE | VOL 518 | 26 FEBRUARY 2015 


© 2015 Macmillan Publishers Limited. All rights reserved 


termed quasars. Because the light coming from 
avery distant quasar takes billions of years to 
reach Earth, astronomers can observe such 
accreting black holes as they were when the 
Universe was young. 

Theoretically, it is not implausible to find a 
black hole of more than 10 billion solar masses 
within 1 billion years after the Big Bang. But 
it is still surprising to uncover such a massive 
black hole in the early Universe. It must have 
been accreting gas at close to the maximum 
rate for most of its existence; the maximum rate 
is set by the pressure of the radiation emit- 
ted by the in-falling material. The prolonged 
period of almost maximum accretion is 
puzzling, because the strong radiation emitted 
by a quasar is generally assumed to be capable 
of halting accretion, limiting its existence to 
10 million to 100 million years. The fact that 
the supermassive black hole has grown to 
12 billion solar masses in less than a billion 
years implies that the radiation did not inhibit 
the high accretion. 

In general, studies of supermassive black 
holes at the centres of nearby galaxies have 
revealed a tight correlation between the mass 
of the black hole and the total mass in stars of 
the galaxy hosting it®. Typically, the mass of a 
black hole is higher when it resides in a more 
massive galaxy, with the ratio of the black-hole 
mass to galaxy mass”” being about 0.14-0.5%. 


Therefore, it has been suggested that the growth 
of both the black hole and the host galaxy are 
causally connected. If the relation between 
black-hole mass and host-galaxy mass were 
to hold true even in the distant Universe, 
we would expect the galaxy harbouring the 
12-billion-solar-mass black hole to contain a 
whopping 4 trillion to 9 trillion solar masses 
in stars, which is the same as the most massive 
galaxies seen in the current Universe. Studying 
this host galaxy will give us a glimpse of how 
massive galaxies formed in the early Universe, 
and of the interplay between the formation of 
stars in the galaxy and the accretion onto its 
central black hole. 

Intriguingly, the black hole discovered by Wu 
and collaborators is not only the most massive 
of its kind known in the early Universe, it is also, 
owing to the high accretion rate, by far the most 
luminous object detected at that cosmic epoch. 
The quasar can therefore be used as a means 
of learning about the distant cosmos. As the 
quasar’s light travels towards observers on 
Earth, it passes through the gas of the inter 
galactic medium. This medium contains 
hydrogen, helium and various metals (elements 
heavier than helium that are produced inside 
stars), which leave an imprint on the spectrum 
of the quasar by absorbing a small amount of 
the quasar’s light at specific wavelengths. The 
brighter the quasar, the more comprehensive 
the investigation of the intervening gas can 
be. Thus, the extreme brightness of the newly 
discovered quasar will allow the abundance of 
metals in the intergalactic medium of the early 
Universe to be measured in unprecedented 
detail. Such measurements will provide infor- 
mation about the star-formation processes at 
work shortly after the Big Bang, which pro- 
duced these metals. 

Finally, quasars as bright as the one reported 
here could easily be seen at larger distances 
from Earth than that of this quasar, and 
hence in an even younger Universe. Although 
accreting supermassive black holes become 
increasingly rare at earlier cosmic times’, 
current and future wide-field near-infrared 
imaging surveys should be able to uncover 
such objects. These giants of the Universe will 
provide the ideal targets from which to learn 
about the Universe during the first few hun- 
dred million years after the Big Bang. m 
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Teleportation for two 


The ‘no-cloning’ theorem of quantum mechanics forbids the perfect copying 
of properties of photons or electrons. But quantum teleportation allows their 
flawless transfer — now even for two properties simultaneously. SEE LETTER P.516 


WOLFGANG TITTEL 


uppose you see a beautiful table in a 

museum and you would like to have 

the same one at home. What could you 
do? One strategy is to accurately measure 
all its properties — its form (length, height 
and width) and its appearance (material and 
colour) — and then reproduce an identical 
copy for your living room. But this ‘measure- 
and-reproduce’ strategy would fail ifthe table 
were a quantum particle, such as a photon or 
an electron orbiting an atomic nucleus. The 
no-cloning theorem’ of quantum mechanics 
tells us that it is impossible to copy such a par- 
ticle perfectly. On page 516 of this issue, Wang 
et al.” show how to get around this apparent 
limitation of quantum physics. In a beautiful 
extension of previous experiments, they demon- 
strate how to transfer the values of two proper- 
ties ofa photon — the spin angular momentum 
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(the direction of the photon’s electric field, 
generally referred to as polarization) and the 
orbital angular momentum (which depends 
on the field distribution) — through quantum 
teleportation onto another photon. 

Quantum teleportation was proposed’ in 
1993 and first demonstrated* in 1997 for a 
single property ofa photon (the polarization). 
It allows the flawless transfer of the unknown 
properties of an object onto a second object 
without contradicting the no-cloning theo- 
rem: the first object loses all its properties at 
the same time, that is, the properties are not 
‘copied’ during quantum teleportation, they 
are transferred. However, the properties of 
the second object after this transfer remain 
unknown — all that is known is that they have 
been made identical to those of the first object 
before teleportation. What is more, the transfer 
does not happen instantaneously, a common 
mistake in the non-scientific literature. 
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Figure 1 | Teleportation of photon polarization and orbital angular momentum. Photon A, whose 
polarization and orbital angular momentum are shown with a small arrow and an ellipse, respectively, 
is measured jointly with photon B, which is quantum-mechanically entangled with photon C. 

This act consists of: a comparative measurement of the polarizations of photons A and B (CM-P); 


a non-destructive verification that exactly one photon exits this measurement in path 1, and hence exactly 
one photon exits in path 2, given that two photons entered CM-P; and a comparative measurement of the 
orbital angular momenta of photons A and B (CM-OAM). The measurements result in the teleportation 
(that is, the transfer) of photon A’s properties onto photon C. The transfer may require rotations of 
photon C’s (unknown) polarization and orbital angular momentum, as determined by the outcomes of 
the comparative measurements. Wang et al.” have implemented all but the rotation steps in this transfer 
scheme. Teleporting the polarization alone does not require the non-destructive measurement, the 
CM-OAM, nor the rotation of photon C’s orbital angular momentum. 
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In addition to the object (A) that carries the 
property to be teleported, quantum telepor- 
tation requires two more objects (B and C; 
Fig. 1). Objects B and C have to be entangled, 
which means that their properties are strongly 
correlated. For instance, the two photons B and 
C should have the same polarization, but the 
actual direction of their individual electric 
fields is not defined. Sounds weird? Just think, 
for example, that they are either both horizon- 
tally polarized or both vertically polarized, or 
both polarized at 45°. Photon A, whose polari- 
zation, say, will be teleported onto photon C, 
is measured jointly with photon B in a way 
that reveals, loosely speaking, the difference 
in the electric fields’ directions without reveal- 
ing the individual directions. What would we 
learn from getting, for instance, zero as the 
result? From the outcome of this comparative 
measurement, we know that the polarization 
of photon A equals that of photon B. Further- 
more, from the entanglement of photons B and 
C, we know that the polarization of photon B 
equals that of photon C. Hence, we find that 
the electric field of photon C must now point 
in the same direction as that of photon A before 
the measurement. 

Note that the outcome of the joint measure- 
ment could also have been different: for 
example, A and B are orthogonally polarized. 
Similar reasoning to that used before would 
lead to the conclusion that photon C’s electric 
field is rotated by 90° with respect to that of 
photon A. Therefore, rotating it back would 
allow one to perfectly recover the original 
polarization encoded in photon A. In short, 
the joint measurement, possibly followed by a 
well-defined rotation of the (unknown) polari- 
zation of photon C, has allowed the teleport- 
ing (transferring) of the polarization property 
from photon A to photon C without error. 

To demonstrate the teleportation of two 
properties, Wang and colleagues started with 
a single photon (photon A in Fig. 1) prepared 
in a combination of polarization and orbital 
angular momentum. Using high-intensity 
laser pulses that pass through a crystal, they 
also created a photon pair (photons B and Cin 
Fig. 1) in a ‘hyper-entangled’ state, in which the 
photons are simultaneously entangled in the 
two properties to be teleported. Making two 
joint measurements (one per property) that 
compared the polarizations and the orbital 
angular momenta of photon A and photon 
B then led to the teleportation of photon A’s 
properties onto photon C. 

The biggest challenge for the researchers 
was the concatenation of the two joint meas- 
urements. It required, as an intermediate step, 
the verification that exactly one photon exited 
the first measurement (that of polarization) in 
each of the two possible paths leading to the 
second measurement (that of angular orbital 
momenta), without destroying the photons. 
The non-destructive detection of, say, a photon 
in path 1 can be implemented by teleporting 


its orbital angular momentum onto another 
photon, which then enters the second joint 
measurement. This is because teleportation 
not only transfers a property from one pho- 
ton to another, but also indicates that a photon 
existed. And, given that two photons entered 
(and hence left) the first comparative meas- 
urement, the non-destructive detection ofa 
photon in path 1 also indicates that one photon 
was present in path 2 — exactly the require- 
ment for the verification step. This step 
needed another 
pair of photons 
(not shown in 
Fig. 1) entan- 
gled in their 
orbital angular 


This is an 
important step in 
understanding, and 
showcasing, one of 


the most profound momenta 

and puzzling An interest- 
predictions of ing question 
quantum physics. is whether 


the demonstra- 
ted method for the teleportation of two prop- 
erties can be generalized to more properties. 
The authors affirm that this is possible in prin- 
ciple. However, the probability of the required 
joint measurements leading to a useful out- 
come becomes smaller and smaller as the 
number of properties (and thus of joint meas- 
urements) increases. Although the probability 
is half in the case of standard (single-property) 
teleportation, it is 1/32 for two properties, as 
shown for the first time by Wang and co-work- 
ers. Furthermore, it decreases to 1/4,096 when 
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teleporting an object that is described by three 
properties. Adding photons and photon detec- 
tors may increase the efficiency’, but this adds 
even more complexity to an already difficult 
measurement. 

Even without these additional photons, 
the joint measurement becomes increasingly 
challenging as the number of properties 
increases: in the teleportation of two proper- 
ties, a ‘one-property teleporter’ is used, and in 
the teleportation of three properties, a ‘two- 
property teleporter’ and a ‘one-property tele- 
porter’ would be needed. You can guess what is 
required for the teleportation of N properties. 
Yet, Wang and colleagues’ demonstration is an 
important step in understanding, and show- 
casing, one of the most profound and puzzling 
predictions of quantum physics. It may serve as 
a powerful building block for future quantum 
networks, which generally require teleportation 
units for the transmission of quantum data. m 
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RNA modification does 
a regulatory two-step 


The m®A structural modification of RNA regulates gene expression. It has now been 
found to mediate an unusual control mechanism: by altering the structure of RNA, 
m°A allows a regulatory protein to bind to that RNA. SEE LETTER P.560 


DOMINIK THELER & FREDERIC H.-T. ALLAIN 


ne of the most abundant modifications 
(): messenger RNA is thought to be 

N*-methyladenosine (m°A), in which 
a methyl group is attached to the N6 position 
of adenine, an RNA base. The m°A modifica- 
tion has a role in regulating gene expression, 
and perturbations of this regulatory machin- 
ery are associated with human disease. But 
little is known about the mechanism by which 
the single methyl group of m°A exerts its effect. 
On page 560 of this issue, Liu et al.' report that 
m°A alters the secondary structure of RNA, 
allowing an RNA-binding protein to access 
the RNA sequence opposite the modification 
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and therefore to regulate expression. 

Most experimental evidence for the 
mechanism and role of RNA modifications 
has been gathered in non-protein-coding 
RNAs. The m°A modification of mRNA was 
first described in 1974 (refs 2, 3), and subse- 
quent studies quickly identified the methy- 
lase protein complex as the machinery that 
‘writes m°A into mRNA (for reviews of m°A, 
see refs 4-6). Impairment of this complex leads 
to developmental arrest in several organisms. 

After those early discoveries, not much was 
learnt about the role of m°A until the start of 
this decade, when m°A demethylase enzymes 
were identified as ‘erasers’ of this modifica- 
tion*®. These findings hinted at the dynamic 
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Figure 1 | Regulation of alternative splicing by the m°A RNA modification. Alternative splicing is the 
process by which different messenger RNAs are generated from transcripts of a single gene. Here, protein- 
coding sequences (exons) of an RNA transcript are shown in red and yellow, and non-coding sequences 
(introns) are in blue. a, In this example, splicing removes the introns and the exon trapped between them. 
b,When an adenine base (A) is methylated by a ‘writer’ protein to form the m°A modification (A-CH,), 
direct binding ofa ‘reader’ protein leads to splicing in which all the exons are included in the mRNA 
product. ‘Eraser’ proteins can reverse the formation of m°A. c, Liu et al.' report that, when m°A forms 

ina stem loop of RNA, the modification can alter the loop’s structure by changing the base-pairing. The 
HNRNPC regulatory protein then binds to a stretch of uridine (U) nucleotides in the loop, also leading to 


inclusion of all the exons. 


nature of m°A. Genetic alterations in the gene 
that encodes one of the erasers are associated 
with obesity and cancer in humans* *. In paral- 
lel, sequencing studies increased the number of 
known m°A sites from justa few to several thou- 
sand, and revealed that the sites are highly evo- 
lutionarily conserved in humans and mice*®. 
‘Reader’ proteins of m‘A have also been 
identified and found to be involved in gene 
expression. All reader proteins that have been 
biochemically validated to bind directly to the 
modification contain a hitherto poorly charac- 
terized RNA-binding domain called YTH. This 
domain binds to the methylated form of a given 
RNA with a several-fold higher affinity than it 
does for the unmethylated form, and the struc- 
tural basis for this enhancement was reported 
last year’. Nevertheless, the mechanisms of 
action of these m°A readers are unknown. 
The secondary structure of RNA provides 
another regulatory layer of gene expression, 
but is not strictly dependent on the primary 
sequence of bases. For example, RNA stem 
loops occur when two regions of a single- 
stranded RNA (ssRNA) molecule form a base- 
paired duplex (the stem) ending in an unpaired 
loop. Stem loops can act statically by attract- 
ing regulatory proteins to the stem or loop, or 
both. But they can also be dynamic, behaving 


as RNA-based regulatory switches that influ- 
ence cellular function by changing structure 
in response to different factors. These factors 
can range from physico-chemical parameters, 
including pH, temperature and the binding 
of ions and metabolites, to biological macro- 
molecules such as nucleic acids and proteins'*. 
Liu and colleagues report that there are several 
thousand such switches in mRNA, and that the 
switches are triggered by RNA modification. 

Building on their discovery’ of an m°A site 
in the stem loop of a long non-coding RNA 
dubbed MALAT1, the authors found that the 
HNRNPC protein preferentially binds to the 
methylated form of this RNA; HNRNPC is an 
abundant ssRNA-binding protein involved in 
the regulation of several post-transcriptional 
gene-regulatory processes, such as alternative 
splicing. An HNRNPC-binding site”® consist- 
ing of a stretch of uridine (U) nucleotides is 
located in the stem loop opposite the m*A site. 
The researchers observed that methylation at 
the m®A site changes the structure of the stem 
loop by destabilizing base pairs and increasing 
the length of the single-stranded U-rich loop, 
thereby making the binding site more acces- 
sible to HNRNPC (Fig. 1). 

Unusually, the m°A modification in 
MALAT1 does not recruit effector proteins 
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(readers) directly, but does so indirectly by 
altering the RNAs structure: methylation acts 
as a switch that alters the loop structure, which 
enables protein binding. This two-step process 
allows tight regulation of the events controlled 
by the m°A-switch. A study" published last 
month combined biophysics, structure deter- 
mination and probing of the structural context 
of m°A in vivo. It revealed the propensity of 
m°A to alter secondary-structural elements 
formed from double-stranded RNA when 
located at the border between single-stranded 
and base-paired regions, as is observed in the 
case of MALAT1. 

Liu et al. went on to characterize the 
abundance and functional roles of m°A- 
switches, identifying 2,798 of them with high 
confidence by observing which m°A sites dis- 
play decreased HNRNPC binding when over- 
all m°A levels in the transcriptome (the full 
set of transcribed RNAs) are reduced. They 
further characterized the interlinked actions 
of HNRNPC and m°A methylation by showing 
that the abundance of 5,251 different cellular 
RNA transcripts alters either when HNRNPC 
binding is inhibited or when m°A methylation 
is decreased. The authors also observed similar 
co-regulation by methylation and HNRNPC 
binding for many alternatively spliced exons 
(protein-coding regions of genes). 

The findings are a huge advance in our 
understanding of how m°A controls gene 
expression, but several questions remain. Is 
HNRNPC the only RNA-binding protein to 
be recruited through such a mechanism upon 
m°A methylation, or are there others? Do the 
reader proteins involved in these m°A-switches 
further destabilize the secondary structure, or 
do they help to recruit the factors? Which cel- 
lular events trigger the switches? And, finally, 
do other RNA modifications in the vast tran- 
scriptome create RNA switches? We think that 
there are probably many more. = 
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Whole genomes redefine the mutational 
landscape of pancreatic cancer 
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Pancreatic cancer remains one of the most lethal of malignancies and a major health burden. We performed whole-genome 
sequencing and copy number variation (CNV) analysis of 100 pancreatic ductal adenocarcinomas (PDACs). Chromosomal 
rearrangements leading to gene disruption were prevalent, affecting genes known to be important in pancreatic cancer 
(TP53, SMAD4, CDKN2A, ARIDIA and ROBO2) and new candidate drivers of pancreatic carcinogenesis (KDM6A and 
PREX2). Patterns of structural variation (variation in chromosomal structure) classified PDACs into 4 subtypes with poten- 
tial clinical utility: the subtypes were termed stable, locally rearranged, scattered and unstable. A significant proportion 
harboured focal amplifications, many of which contained druggable oncogenes (ERBB2, MET, FGFR1, CDK6, PIK3R3 and 
PIK3CA), but at low individual patient prevalence. Genomic instability co-segregated with inactivation of DNA main- 
tenance genes (BRCA1, BRCA2 or PALB2) and a mutational signature of DNA damage repair deficiency. Of 8 patients who 
received platinum therapy, 4 of 5 individuals with these measures of defective DNA maintenance responded. 


Pancreatic cancer (PC) has a median survival of 6 months anda5-year _ better select patients for current therapies and develop novel thera- 
survival that remains less than 5% despite 50 years of research and peutic strategies. 

therapeutic development". It is the fourth commonest cause of cancer Recent exome and CNV analyses of pancreatic ductal adenocarci- 
death in Western societies and is projected to be the second leading noma have revealed a complex mutational landscape””. Activating muta- 
cause within a decade. As a consequence, there is an urgent need to _ tions of KRAS are near ubiquitous and inactivation of TP53, SMAD4 
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and CDKN2A occur at rates of >50%. The prevalence of recurrently 
mutated genes then drops to ~10% for a handful of genes involved in 
chromatin modification, DNA damage repair and other mechanisms 
known to be important in carcinogenesis; however, a long tail of infre- 
quently mutated genes dominates, resulting in significant intertumoural 
heterogeneity. Faced with this diversity, it is not surprising that thera- 
peutic development using an unselected approach to patient recruit- 
ment for clinical trials has been challenging” *. 

Somatic structural rearrangement of chromosomes represents a com- 
mon class of mutation that is capable of causing gene disruption (such as 
deletion or rearrangement), gene activation (for example, copy number 
gain or amplification) and the formation of novel oncogenic gene pro- 
ducts (gene fusions). Many of these events actively drive carcinogenesis”® 
and in some instances present therapeutic targets. Early karyotyping’ and 
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more recent genomic sequencing of small numbers of primary tumours 
(n = 3) and metastases (n = 10) suggests that PDAC genomes contain 
widespread and complex patterns of chromosomal rearrangement*”. 
Here we performed deep whole-genome sequencing of 100 PDACs 
and show that structural variation (variation in chromosomal structure) 
is an important mechanism of DNA damage in pancreatic carcinogen- 
esis. We classify PDAC into four subtypes based on structural variation 
profiles and implicate molecular mechanisms underlying some of these 
events. Finally, as proof of concept, we use a combination of structural 
variation, mutational signatures and gene mutations to define putative 
biomarkers of therapeutic responsiveness for platinum-based chemo- 
therapy, which are current therapeutic options for PDAC’?™, and for 
therapeutics that target similar molecular mechanisms such as PARP 
inhibitors’® that are currently being tested in clinical trials. 
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Figure 1 | Mutations in key genes and pathways in pancreatic cancer. The 
upper panel shows non-silent single nucleotide variants and small insertions 
or deletions. The central matrix shows: non-silent mutations (blue), copy 
number changes (amplification (>5 copies) represented in red and loss 
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represented in green) and genes affected by structural variants (SV, yellow). 
Pathogenic germline variants are highlighted with asterisk (*) symbols. The 
histogram on the left shows the number of each alteration in each gene. 
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Genomic landscape of pancreatic cancer 


Patients were recruited and consent obtained for genomic sequencing 
through participating institutions of the Australian Pancreatic Cancer 
Genome Initiative (APGI; htpp://www.pancreaticcancer.net.au) as part 
of the International Cancer Genome Consortium (ICGC; http://www. 
icgc.org)'® (Supplementary Table 1). Array-based CNV was analysed using 
GAP” and tumour cellularity estimated with qPure’*. Whole-genome 
sequencing was performed on 100 primary PDACs with an epithelial 
cellularity of = 40% (n = 75), and complemented by cell lines derived 
from APGI participants (n = 25) to an average depth of 65 X, and com- 
pared to the germline (average depth 38) (Supplementary Table 2). 
Mutations were detected using qSNP’” and GATK and indels called with 
Pindel and GATK. 


Point mutations and structural variation in PDAC 

A total of 857,971 somatic point mutations and small insertions and 
deletions were detected in the cohort: 7,888 were non-silent mutations 
in 5,424 genes (Supplementary Tables 3 and 4). Orthogonal validation 
of >3,000 exonic mutations estimated the accuracy of mutation calls 
at >95% (Methods). Consistent with previous estimates”, the average 
mutational burden across the cohort was 2.64 per Mb (range 0.65- 
28.2 per Mb). Somatic structural variants were identified with the qSV 
package, which uses multiple lines of evidence to define events (discor- 
dant pairs, soft clipping and split reads). Events verified using an orthog- 
onal sequencing method were also included (Methods and Extended 
Data Fig. 1a). Where possible, these events were cross-referenced with 
CNV data (Methods). In total, 11,868 somatic structural variants were 
detected at an average of 119 per individual (range 15-558) (Supplemen- 
tary Table 5 and Extended Data Fig. 1b). The majority of structural 
variants were intra-chromosomal (10,114) and were classified into 7 
types: intra-chromosomal rearrangements (5,860), deletions (1,393), 
duplications (128), tandem duplications (179), inversions (1,629), fold- 
back inversions (579) and amplified inversions (346); inter-chromosomal 
translocations were less prevalent (1,754) (Supplementary Table 6). A 
total of 6,908 rearrangements directly disrupted gene sequences and 
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1,220 genes contained a breakpoint in 2 or more patients (Supplemen- 
tary Table 7). Recurrent gene fusions were not detected: 1,236 structural 
variants led to the joining of two gene loci, however, only 183 of these 
events were fused in an orientation and frame that was capable of express- 
ing a product, and none of these predicted fusion events occurred in 
more than one sample. 


Genes affected by mutation and structural variation 


Commonly mutated genes that characterize PDAC (KRAS, TP53, SMAD4 
and CDKN2A)”” were reaffirmed as significant using MutSig”' analysis 
(Supplementary Table 8). Combining structural variation events with 
deleterious point mutations increased the prevalence of inactivation 
events for TP53 to 74% (3 structural variants and 71 mutations), 31% for 
SMAD4 (9 structural variants and 22 mutations) and 35% for CDKN2A 
(11 structural variants and 24 mutations). Two additional genes not 
previously described in human PDAC (KDM6A and PREX2) had recur- 
rent pathogenic mutations and structural variants at a rate of 10% or 
more. KDM6A is a SWI/SNF interacting partner that was identified 
in a pancreatic sleeping-beauty transposon mutagenesis screen”, and 
is mutated in RCC and medulloblastoma. In our cohort, KDM6A was 
inactivated in 18% of patients, (4 frame shifts, 1 in-frame deletion and 2 
missense mutations, 5 structural variants and 8 homozygous deletions). 
In most cases (n = 15), both alleles of KDMG6A were affected. The RAC1 
guanine nucleotide exchange factor PREX2, mutated in melanoma” 
was inactivated in 10% of PDAC patients (1 frame shift, 1 splice site and 
5 missense mutations, 2 structural variants and 1 homozygous deletion). 
In addition, the tumour suppressor gene RNF43, originally identified 
in cystic tumours of the pancreas, was inactivated in 10% of PDAC 
patients (4 frameshift and 4 nonsense mutations, 2 structural variants). 
Two of these PDACs had an associated intraductal papillary mucinous 
neoplasm (IPMNs). Recent studies have suggested that loss of functional 
RNF43 may confer sensitivity to WNT inhibitors”. Figure 1 shows the 
prevalence of aberrations in key driver genes and pathways in PDAC; 
implicating structural variation as an important mutational mecha- 
nism in pancreatic carcinogenesis. 
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Figure 2 | Subtypes of pancreatic cancer. a, Subgroups of PDAC based on the 
frequency and distribution of structural rearrangements. Representative 
tumours of each group are shown. The coloured outer rings are chromosomes, 
the next ring depicts copy number (red represents gain and green represents 
loss), the next is the B allele frequency (proportion of the B allele to the 

total quantity of both alleles). The inner lines depict chromosome structural 


Participants 


rearrangements. b, The contribution of the BRCA mutational signature within 
each tumour ranked by prevalence (red bars). Unstable tumours are associated 
with a high BRCA mutation signature and deleterious mutations in BRCA 
pathway genes. The dagger (+) symbol indicates predicted only as possibly 
damaging by Polyphen2. 
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Subtyping using structural rearrangements 


The distribution of events was used to classify tumours into the follow- 
ing four subtypes (Fig. 2 and Extended Data Fig. 2 and Methods). 


Stable subtype 

Subtype 1 was classified as ‘stable’ (20% of all samples). These tumour 
genomes contained = 50 structural variation events and often exhibited 
widespread aneuploidy suggesting defects in cell cycle/mitosis (Extended 
Data Fig. 3). Point mutation rates for KRAS and SMAD4 were similar 
to the rest of the cohort, and the prevalence of TP53 mutations was only 
slightly less (61% versus a mean of 70% across all samples). In addition, 
telomere length was no different in comparison to other subgroups. 


Locally rearranged subtype 

Subtype 2 was classified as ‘locally rearranged’ (30% of all samples). 
This subtype exhibited a significant focal event on one or two chromo- 
somes. The group could be further divided into those with focal regions 
of gain/amplification and those that contained complex genomic rear- 
rangements (Extended Data Fig. 4). Approximately one-third of locally 
rearranged genomes contained regions of copy number gain that har- 
boured known oncogenes (Supplementary Table 9). These included 
common focal amplifications in KRAS, SOX9 and GATA6 and often 
included therapeutic targets such as ERBB2, MET, CDK6, PIK3CA and 
PIK3R3, but at low individual prevalence (1-2% of patients) (Supplemen- 
tary Table 9). The remaining local rearrangements involved complex 
genomic events such as breakage-fusion—-bridge (BFB, n = 9) or chro- 
mothripsis’”* (n = 15), which resulted in a ring chromosome in at least 
one case (ICGC_0059) (Extended Data Figs 5 and 6 ). Chromothripsis 
is linked to TP53 mutations in medullobastoma and acute myeloid leu- 
kaemia and here, 10/13 chromothriptic tumours had a TP53 mutation, 
5 of which were bi-allelic (Fig. 1). Five of these chromothriptic events 
occurred after chromosomal duplication suggesting that they are less 
likely to be driving carcinogenesis (Methods). 


Scattered subtype 

Subtype 3 was classified as ‘scattered’ (36% of all samples). Tumours 
in this class exhibited a moderate range of non-random chromosomal 
damage and less than 200 structural variation events (Extended Data 
Fig. 7). 


Unstable subtype 

Subtype 4 was classified as ‘unstable’ (14% ofall samples). The tumours 
exhibited a large number of structural variation events (>200; maxi- 
mum of 558) (Extended Data Fig. 8). This scale of genomic instability 
suggested defects in DNA maintenance”, which potentially defines sen- 
sitivity to DNA-damaging agents (Fig. 3a; Methods). 


Genomic markers of defective DNA maintenance 


We mapped the relationship between the unstable subtype, mutations 
in BRCA pathway genes and a recently described mutational signature 
associated with deleterious mutations in BRCA1 or BRCA2 in breast, 
ovarian and pancreatic cancer”. The majority of unstable tumours (10 
of 14) fell within the top quintile of the BRCA signature when ranked by 
prevalence per Mb (Fig. 2b). In addition, the top quintile of the BRCA 
signature was associated with deleterious mutations of BRCA1 (n = 2), 
BRCA2(n = 7), and PALB2 (n = 2) (Fig. 2b) (Supplementary Table 10). 
Four of the BRCA2 mutations were germline in origin (3 frameshift and 
1 nonsense), and in each case, the wild-type allele was inactivated in the 
tumour. A further 2 patients had somatic mutations in BRCA1 (both 
with splice site mutations), and another 3 had somatic BRCA2 muta- 
tions (1 indel and 2 splice site mutations). All deleterious BRCA1 and 
BRCA2 mutations had inactivation of the second allele. Three patients 
had pathogenic germline PALB2 mutations that were associated with 
the BRCA mutational signature. One of these was a TGTT deletion, 
which is known to occur in pancreatic cancer” (this tumour also had 
a somatic BRCA2 mutation), and the mutations of PALB2 in both the 
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other 2 cases are associated with an inherited predisposition to breast 
cancer’®. Germline PALB2 mutation carriers did not have evidence of 
somatic loss of the second allele; however, heterozygous germline muta- 
tion of PALB2 appears sufficient to cause DNA replication and damage 
response defects”. In contrast, tumours containing a somatic heterozy- 
gous silent mutation of BRCA2, a heterozygous intronic structural vari- 
ation and 2 unclassified heterozygous missense mutations in BRCA1 
(predicted to be benign or only possibly damaging by Polyphen2) were 
not associated with a high-ranking BRCA mutational signature (<1 
BRCA signature mutation per Mb) or an unstable genome (Supplementary 
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Figure 3 | Putative biomarkers of platinum and PARP inhibitor 
responsiveness. a, A Venn diagram showing the overlap of surrogate measures 
of defects in DNA maintenance (unstable genomes and BRCA mutational 
signature), with mutations in BRCA pathway genes. Of a total of 24 patients 
(24%), 10 have both unstable genomes and the BRCA mutational signature. 
The majority of patients with mutations in BRCA pathway genes (9) are within 
this intersect, however 2 have the mutational signature, but are classified either 
as scattered (n = 1) or locally rearranged (n = 1). GL, germline; S, somatic. 

b, Individual tumours are ranked based on their BRCA mutational signature 
burden, with the diameter of each circle representing the number of structural 
variants in each. Those encircled by a solid line have mutations in BRCA 
pathway genes. Responders and non-responders to platinum-based therapy are 
indicated with solid lines for patients and broken lines for patient-derived 
xenografts (PDX). 
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Table 10). Overlapping deleterious mutations in BRCA1, BRCA2 and 
PALB2 with unstable genomes and the BRCA mutational signature 
showed that mutations in these genes were associated with the top quin- 
tile of the BRCA mutational signature, and the majority (9 of 11) also 
exhibited unstable genomes (Fig. 3a). 


Defective DNA repair without BRCA pathway mutations 


Mutations in BRCA pathway genes accounted for approximately half 
of patients with a high BRCA mutational signature and/or an unstable 
genome (Fig. 3a). Hyper-methylation is known to play a role in silen- 
cing BRCA1, BRCA2 and PALB2 in some breast and ovarian cancers; 
however, high-density methylome array profiling of this cohort” allowed 
us to exclude this as a contributing mechanism. Single instances of 
biallelic, inactivating, somatic mutation was observed for two genes 
known to induce genomic instability and chemosensitivity when inac- 
tivated: RPA1 (ref. 31) (splice site and loss of heterozygosity (LOH)), 
and the DNA polymerase zeta catalytic unit/REV3L* (nonsense and 
LOH). We also detected mutations in other genes involved in DNA 
maintenance such as ATM, FANCM, XRCC4 and XRCC6 in tumours 
with an unstable genome or the BRCA mutational signature; however, 
they are yet to be causally linked to these genomic events or sensitivity 
to DNA-damaging agents. 


Putative genotypes of platinum responsiveness 


As the APGI was a prospective observational cohort study with exten- 
sive clinical follow-up, it was possible to track therapeutic responsiveness 
of participants that received chemotherapy when their disease recurred. 
At the time of analysis, 53 patients had documented recurrences and 25 
received a variety of chemotherapeutic agents (Supplementary Table 11). 
This analysis was complemented through therapeutic testing of patient- 
derived xenografts (PDXs) generated from APGI participants. Overall, 
8 patients received a platinum-based therapy and 7 PDXs were treated 
with gemcitabine and cisplatin (Fig. 3b). Of 5 patients with unstable 
genomes and/or a high BRCA mutational signature burden (designated 
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as ‘on-genotype’) 2 had exceptional responses (defined as complete radio- 
logical resolution of disease and normalization of CA19.9 levels*’), and 
2 had robust partial responses based on RECIST1.1 criteria** (Fig. 4a), 
while 3 patients who did not have any of these characteristics (‘off- 
genotype’) did not respond. These observations were supported by PDX 
studies where 2 of 3 on-genotype PDXs responded to cisplatin (one 
BRCA2 mutant responded and one carrying bi-allelic inactivation of 
RPAI, which notably retained RADS51 foci (Extended Data Fig. 9) also 
responded. Another, with a mutational signature but not an unstable 
genome, and without a mutation in a BRCA pathway gene, did not 
respond. This compares to no responses in the 4 PDXs in the off-genotype 
group (Figs 3 and 4b). Combining patient and PDX response data, on- 
genotype tumours were associated with response to platinum-based 
therapy (P = 0.0070, Fisher’s exact test, Fig. 3b) (Supplementary Table 11). 


Discussion 


This study provides the most comprehensive description, to date, of the 
genomic events that characterize pancreatic cancer and demonstrates 
that structural variation is a prominent mechanism of genomic damage 
in this disease. It reinforces the importance of KRAS, TP53, SMAD4, 
CDKN2A and ARIDIA gene mutations, in addition to numerous genes 
mutated at low prevalence. Recurrent mutations identified in KDM6A 
further highlights the role of chromatin modification and a broader role 
for aberrant WNT signalling is implicated through the relatively fre- 
quent inactivation of suppressor genes such as ROBO1, ROBO2, SLIT2 
and RNF43. 

Structural variant analysis classifies PDAC into four subtypes with 
potential clinical relevance. A significant proportion of tumours contain 
amplifications and copy-number gains of known oncogenes, but most 
occur at low individual prevalence, suggesting significant diversity of 
mechanisms involved in PDAC progression. Several of these constitute 
known therapeutic targets with available inhibitors (ERBB2, MET, FGFR1). 
Others include: GATA6, which is known to be amplified in PDAC and 
correlates with poor survival in other cancer types*; PIK3CA, which is 
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amplified in ovarian** and lung squamous cell carcinomas”; PIK3R3 
amplified in ovarian cancer; and CDK6, amplified in oesophageal cancer. 
These may present opportunities for therapeutic intervention, either 
alone or in combination with other agents. 

Multiple studies of platinum-based therapies in PDAC have shown 
borderline signals, and some meta-analyses show a benefit”, suggest- 
ing that individual studies were underpowered, and that these signals 
could be driven by subgroups of responders. More recently, addition of 
oxaliplatin has shown efficacy in second line therapy, and FOLFIRINOX, 
a platinum-containing combination therapy is emerging as a treatment 
option for advanced PDAC. Most patients do not receive this therapy 
due to its toxicity, or it is substantially modified**. There are, however, 
significant responses in subgroups that are not well-defined’, and 
improved survival reported in patients with germline BRCA1 and BRCA2 
mutations who receive platinum-based therapies*’. Defining biomarkers 
of platinum responsiveness would significantly alter current treatment 
approaches to PDAC and improve overall outcomes. Current patient 
recruitment strategies for clinical trials of PARP inhibitors, thought to 
target similar mechanisms, are mostly based on germline deleterious 
mutations of BRCA1 and BRCA2. If we take into account mutations in 
BRCA pathway components, both germline and somatic, as well as puta- 
tive surrogate measures of deficiencies in DNA maintenance, that is, 
unstable genomes and the BRCA mutational signature, germline muta- 
tions in BRCA1 and BRCA2 only account for as fewas 4 ofa potential 24 
(17%), and only 4% ofall patients. Genomic instability and BRCA muta- 
tional signature status based on whole-genome sequencing also provide 
independent evidence of putative deficiencies in DNA damage repair. 
It remains to be seen whether these surrogate measures are predictive 
of therapeutic response in the absence of BRCA or PALB2 mutations. 
However, the presence of mutations in non-BRCA pathway genes that 
are associated with both genomic instability and chemosensitivity in 
2/14 unstable tumours suggests that diagnostic whole-genome sequenc- 
ing to detect surrogate measures of defects in DNA maintenance may 
ultimately be a better method of identifying potential responders to 
platinum and PARP inhibitor therapy. 

The proof of concept data presented here suggest that mutations in 
BRCA pathway component genes and surrogate measures of defects in 
DNA maintenance (genomic instability and the BRCA mutational sig- 
nature) have potential implications for therapeutic selection for pancre- 
atic cancer. These data define a putative biomarker hypothesis that needs 
testing in a clinical trial, as these results are from a small number of 
patients selected based on high tumour cellularity; patients often received 
combination therapies, and the primary tumour was sequenced rather 
than the recurrence. As only selected gene sets can be tested in the clinic 
at this time, surrogate measures of molecular mechanisms identified 
using whole-genome sequencing can be used to inform individual gene 
selection for clinical use. As diagnostic genomic approaches continue 
to evolve and become more affordable, whole-genome sequencing may 
provide new opportunities in the clinic. However, there are significant 
hurdles still to overcome. These include the technical challenge of whole- 
genome sequencing using small diagnostic samples that are preserved in 
fixatives such as formalin, analytical demands and the return of results 
within a clinically relevant timeframe. Major initiatives are emerging 
that aim to address these challenges (such as Genomics England and 
the Scottish Genomes Partnership) to ultimately advance and assess 
these approaches for their potential to improve human health for many 
diseases including cancer. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


Human research ethical approvals. Australian Pancreatic Cancer Genome Initi- 
ative: Sydney South West Area Health Service Human Research Ethics Committee, 
western zone (protocol number 2006/54); Sydney Local Health District Human 
Research Ethics Committee (X11-0220); Northern Sydney Central Coast Health Har- 
bour Human Research Ethics Committee (0612-251M); Royal Adelaide Hospital 
Human Research Ethics Committee (091107a); Metro South Human Research 
Ethics Committee (09/QPAH/220); South Metropolitan Area Health Service Human 
Research Ethics Committee (09/324); Southern Adelaide Health Service/Flinders 
University Human Research Ethics Committee (167/10); Sydney West Area Health 
Service Human Research Ethics Committee (Westmead campus) (HREC2002/3/4.19); 
The University of Queensland Medical Research Ethics Committee (2009000745); 
Greenslopes Private Hospital Ethics Committee (09/34); North Shore Private Hos- 
pital Ethics Committee. Johns Hopkins Medical Institutions: Johns Hopkins Medi- 
cine Institutional Review Board (NA00026689). ARC-NET, University of Verona: 
approval number 1885 from the Integrated University Hospital Trust (AQUI) Ethics 
Committee (Comitato Etico Azienda Ospedaliera Universitaria Integrata) approved 
in their meeting of 17 November 2010 and documented by the ethics committee 
52070/CE on 22 November 2010 and formalized by the Health Director of the AOUI 
on the order of the General Manager with protocol 52438 on 23 November 2010. 
Ethikkommission an der Technischen Universitat Dresden (Approval numbers 
EK30412207 and EK357112012). 

Animal experiment approvals. Mouse experiments were carried out in compli- 
ance with Australian laws on animal welfare. Mouse protocols were approved by 
the Garvan Institute/St Vincent’s Hospital Animal Ethics Committee (ARA 09/19, 
11/23 and 12/21 protocols). Female NOD/SCID/interleukin 2 receptor [IL2R] gamma 
(null) (NSG) mice and athymic Balb-c-nude mice were housed with a 12h light, 
12h dark cycle, receiving food ad libitum. 

Sample acquisition. Samples used were prospectively acquired and restricted to 
primary operable, non-pretreated pancreatic ductal adenocarcinoma. After ethical 
approval was granted, individual patients were recruited preoperatively and con- 
sented using an ICGC approved process. Immediately following surgical extirpa- 
tion, a specialist pathologist analysed specimens macroscopically and samples of 
the tumour, normal pancreas and duodenal mucosa were snap frozen in liquid nitro- 
gen (for full protocol see APGI website: http://www.pancreaticcancer.net.au/). The 
remaining resected specimen underwent routine histopathologic processing and 
examination. Once the diagnosis of pancreatic ductal adenocarcinoma was made, 
representative sections were reviewed independently by at least one other patholo- 
gist with specific expertise in pancreatic diseases (authors: A.G., D.M., R-H.H. and 
A.C.), and only those where there was no doubtas to the histopathological diagnosis 
were entered into the study. Co-existent intraductal papillary mucinous neoplasms 
in the residual specimen were not excluded provided the bulk of the tumour was 
invasive carcinoma, and the invasive carcinoma samples were used for sequencing. 
All samples were stored at —80 °C. Duodenal mucosa or circulating lymphocytes 
were used for generation of germline DNA. A representative sample of duodenal 
mucosa was excised and processed in formalin to confirm non-neoplastic histology 
before processing. All participant information and biospecimens were logged and 
tracked using a purpose-built data and biospecimen information management 
system (Cansto Pancreas). Median survival was estimated using the Kaplan-Meier 
method and the difference was tested using the log-rank test. P values of less than 
0.05 were considered statistically significant. Statistical analysis was performed using 
StatView 5.0 Software (Abacus Systems, Berkeley, CA, USA). Disease-specific sur- 
vival was used as the primary endpoint. 

Sample extraction. Samples were retrieved, and either had full face sectioning per- 
formed in OCT or the ends excised and processed in formalin to verify the presence 
of carcinoma in the sample to be sequenced and to estimate the percentage of malig- 
nant epithelial nuclei in the sample relative to stromal nuclei. Macrodissection was 
performed if required to excise areas of non-malignant tissue. Nucleic acids were 
then extracted using the Qiagen Allprep Kit in accordance with the manufacturer’s 
instructions with purification of DNA and RNA from the same sample. DNA was 
quantified using Qubit HS DNA Assay (Invitrogen). Throughout the process, all 
samples were tracked using unique identifiers. 

Patient material. One hundred matched normal and tumour derived samples 
were obtained from patients with PDAC. DNA was extracted from the samples 
using the QiagenAllprep DNA/RNA mini kit method. Tumour cellularity was deter- 
mined from SNP array data using qpure’®. Clinical and sample data are summarized 
in (Supplementary Table 2). Patients were recruited and consent obtained for geno- 
mic sequencing through the Australian Pancreatic Cancer Genome Initiative (APGI) 
as part of the International Cancer Genome Consortium (ICGC)'*. 
Patient-derived cell line (PDCL) generation. The PDX-derived primary cell lines, 
named The Kinghorn Cancer Centre (TKCC) lines, were generated in the laboratory. 
All cell lines were profiled by short tandem repeat (STR) DNA profiling as unique 
(http://www.cellbankaustralia.com). Briefly, patient-derived tumours established 


in immunocompromised mice were mechanically and enzymatically dissociated 
using collagenase (Stem Cell Technologies, USA) and plated onto flasks coated with 
0.2mg ml’ rat tail collagen (BD Biosciences, USA). Subsequently, epithelial cultures 
were enriched and purified using a FACS Aria III Cell sorter (BD Biosciences, USA), 
using a biotinylated anti-mouse MHCI antibody (1:200 dilution; eBiosciences, USA) 
coupled with Streptavidin AlexaFluor 647 secondary step (1:1,000; Invitrogen, USA) 
and anti-mouse CD140a-PE antibody (1:300; BD Biosciences, USA) to remove mouse 
stroma. Dead cells were removed using propidium iodide (Sigma-Aldrich, Australia). 
Following establishment, all patient-derived (TKCC) cell lines were profiled by short 
tandem repeat (STR) DNA profiling as unique (http://www.cellbankaustralia.com). 
Sequencing. DNA (1 11g) was diluted to 52.5 pl in DNase-/RNase-free molecular 
biology grade water before fragmentation to approximately 300 bp using the Covaris 
S2 sonicator with the following settings Duty Cycle 10%, intensity 5, cycles per burst 
200, time 50 s or 45 s for PCR-Free libraries. Following fragmentation libraries for 
sequencing were prepared using the standard Illumina library preparation tech- 
nique of end-repair, adenylate 3’ ends, indexed adaptor ligation, size selection and 
finally PCR enrichment for adaptor ligated library molecules following the man- 
ufacturer’s recommendations (Part no. 15026486 Rev. C July 2012). A subset of 
libraries was generated omitting the final PCR enrichment step to generate PCR- 
Free libraries as per the manufacturer’s recommendations (Part no. 15036187 Rev. 
A Jan 2013). For standard libraries commercially available TruSeq DNA LT Sample 
Prep Kit v2 (Catalogue no. FC-121-2001) were used for all steps with the following 
exceptions. Size selections of the Adaptor Ligated fragments were completed using 
two rounds of SPRI bead purifications (AxyPrepMag PCR Clean-upCatalog no. 
MAG-PCR-CL-250) using a final bead to DNA volume ratio of 0.60:1 followed by 
0.70:1, selecting for molecules with an average size of 500 bp. Size-selected libraries 
were then amplified for a total of 8 cycles of PCR to enrich for DNA fragments both 
compatible with sequencing and containing the ligated indexed adaptor. For PCR- 
Free libraries commercially available TruSeq PCR-Free DNA LT Sample Prepara- 
tion Kit (Catalog no. FC-121-3001 and FC-121-3002) was used following the 350 bp 
library LT protocol for all steps with no modifications. The final whole-genome 
libraries were qualified (amplified and PCR-Free libraries) and quantified (amplified 
libraries only) via the Agilent BioAnalsyser 2100 (Catalog ID:G2940CA) instrument 
using the DNA High Sensitivity kit (Catalog ID:5067-4626). Quantification of PCR- 
Free libraries was performed using the KAPA Library Quantification Kits For 
Illumina sequencing platforms (Kit code KK4824) in combination with Life Tech- 
nologies Viia 7 real time PCR instrument. 

Whole genome libraries were prepared for cluster generation by cBot (catalogue 
no. SY-301-2002) and sequencing as per the manufacturer’s guidelines. Individual 
libraries were clustered on a single lane of a HiSeq v3 flowcell using the TruSeq PE 
Cluster Kit v3-cBot-HS kit (Catalogue no. PE-401-3001). Illumina supplied con- 
trol library PhiX (10 pM) was spiked into each lane at a concentration of 0.3% to 
provide real time analysis metrics. Final library concentrations of 8 pM (amplified) 
and 14pM (PCR-free) were used for cluster generation. Clustered flowcells were 
sequenced on the Illumina HiSeq 2000 instrument (HiSeq control software v1.5/ 
Real Time Analysis 1.13) using TruSeq SBS Kit v3-HS (200 cycles, Catalog no. FC- 
401-3001). Paired reads each of 101 bp were generated for all libraries and in total 
approximately 220-million paired reads were generated per lane, in line with the 
manufacturer’s specification. Real time analysis of the control library PhiX showed 
cluster density, error rates, quality scores, mapping rates and phasing rates were 
also in line with published specifications. 

Sequence alignment and data management. Sequence data was mapped to a 
genome based on the Genome Reference Consortium (http://www.ncbi.nlm.nih. 
gov/projects/genome/assembly/grc/human/) GRCh37 assembly using BWA”. Mul- 
tiple BAM files from the same sequence library were merged and within library 
duplicates were marked. Resulting final BAMs were used as input into variant call- 
ing. All BAM files have been deposited in the EGA (Accession number: EGAS 
00001000154). 

Copy number analysis. Matched tumour and normal patient DNA was assayed 
using Illumina SNP BeadChips as per manufacturer’s instructions (Illumina, San 
Diego CA) (HumanOmnil-Quad or HumanOmni2.5-8 BeadChips). SNP arrays 
were scanned and data was processed using the Genotyping module (v1.8.4) in 
Genomestudio v2010.3 (Illumina, San Diego CA) to calculate B-allele frequencies 
(BAF) and logR values. GenoCN* and GAP” were used to call somatic regions of 
copy number change - gain, loss or copy neutral LOH. Recurrent regions of copy 
number change were determined and genes within these regions were extracted 
using ENSEMBL v70 annotations. 

Identification of structural variations. Somatic structural variants were identified 
using the qSV tool (manuscript in preparation). qSV uses independent lines of evi- 
dence to call structural variants including discordant reads, soft clipping and split 
read. Breakpoints are also identified using both de novo assembly of abnormally 
mapping reads and split contig alignment to enhance break point resolution. Depend- 
ing on the level of evidence qSV bins calls into different categories and calls were 
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considered high confidence if: (i) they were category 1 and therefore contain mul- 
tiple lines of evidence (discordant pairs, soft clipping on both sides and split reads); 
(ii) they were category 2 and therefore there was 2 lines of evidence: discordant pairs 
(both breakpoints) and soft clipping; or discordant pairs (both breakpoints) and split 
read; or soft clipping (double sided) and split read; (iii) they were category 3 with 10 
or more supporting events (discordant read pairs or soft clipping at both ends). 
Only high confidence calls were used in further downstream analysis. Copy number 
variation was estimated using SNP arrays and the GAP tool'’. Depending on the 
read pair types supporting an aberration or the associated of copy number events 
each structural variant was classified as: deletion, duplication, tandem duplication, 
foldback inversion, amplified inversion, inversion, intrachromosomal or translo- 
cation. Essentially, the type of rearrangement is initially inferred from the orientation 
information of discordant read pairs, soft clipping clusters and assembled contigs 
which span the breakpoints. This allows identification of 4 groups of events: duplica- 
tions/intra-chromosomal rearrangements, deletions/intra-chromosomal rearrange- 
ments, inversions and inter-chromosomal translocations. Boundaries of segments 
of copy number that occur in close proximity to each breakpoint were then used to 
aid further classification of the events. Structural variants with breakpoints that 
flanked a copy number segment of loss were annotated as deletions. Duplications 
and inversions associated with increases in copy number enabled the character- 
ization of tandem duplications and amplified or foldback inversions. Events within 
the same chromosome which linked the ends of copy number segments of simi- 
lar copy number levels were often identified and were called intra-chromosomal 
rearrangements. 

Events were then annotated if they were within 100 kb of a centromere or telo- 

mere and genes which were affected by breakpoints were annotated using ENSEMBL 
v70. Structural variants and copy number data were visualized using circos™. 
The landscape of structural rearrangements in pancreatic ductal adenocarci- 
noma. In total 11,868 structural variants were detected within the 100 PDAC cohort 
with an average of 119 events per patient (range 15-558). Each event was classified 
into one of 8 categories: deletion, duplication, tandem duplication, foldback inver- 
sion, amplified inversion, inversion, intra chromosomal and translocation. Within 
the cohort there was inter patient heterogeneity in terms of total number of events 
(range of events per patient 15-558) and proportion of event type (Extended Data 
Fig. 1). 
Classification of subtypes based on the pattern or structural rearrangements. 
Each tumour was classified into one of four subtypes based on the volume of events, 
the predominance of specific types of structural rearrangement events and the dis- 
tribution of events across the genome in each patient. In addition to counting struc- 
tural variation events, two analyses were carried out to detect localized events. 
Non-random chromosomal clustering of structural variants was detected using an 
approach originally described by Korbel and Campbell”. Significant clustering of 
structural variation events was determined by a goodness-of-fit test against the 
expected exponential distribution of (with a significance threshold of < 0.0001). 
Highly focal events were detected using an adaptation of a method* where chro- 
mosomes with a high structural variant mutation rate per Mb exceeded 5 times the 
length of the interquartile range from the 75th percentile of the chromosome counts 
for each patient. The rules used to determine these subtypes are as follows: 

Stable These tumours contain few structural rearrangements (<50) which are 
located randomly through the genome. 

Locally rearranged The intra-chromosomal rearrangements in these tumours 
are not randomly positioned through the genome, instead they are clustered on one 
or few chromosomes. To correct for the different chromosome lengths, the number 
of events per Mb was calculated for each chromosome within each tumour. Tumours 
were considered locally rearranged if they harboured at least 50 somatic events 
within the genome and contained a locally rearranged chromosome. Chromosomes 
were considered locally rearranged if the number of intrachromosomal events 
exceeded 5 times the length of the interquartile range from the 75th percentile of 
the chromosome counts per Mb for that patient. The events in the locally rearranged 
tumours are broadly comprised of either: (1) focal amplifications—the majority of 
events are gain (tandem duplication, duplication, foldback inversion or amplified 
inversion) or (2) complex rearrangements—the events are part of a complex event 
such as chromothripsis or breakage-fusion-bridge. 

Scattered These tumours contain 50-200 structural rearrangements which are 
scattered throughout the genome. 

Unstable These tumours are massively rearranged as they contain >200 struc- 
tural rearrangements which are generally scattered throughout the genome. 
Classification of complex localized events. Evidence of clustering of breakpoints 
was estimated as proposed by Korbel and Campbell’’. Chromosomes with cluster- 
ing of structural variants were reviewed for evidence of chromothripsis (oscillation 
of copy number, random joins and retention of heterozygosity) and breakage-fusion- 
bridge (BEB for loss of telomeric region with neighbouring highly amplified region 
with inversions). 
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Verification of structural variations. We used two methods of verification for 
structural variants: (1) an in silico approach, which considers events with multiple 
lines of evidence (qSV category 1: discordant pairs, soft clipping on both sides and 
split read evidence) as verified, as well as events which were associated with a copy 
number change (gain or loss) and (2) orthogonal sequencing methods including 
SOLiD long mate pair and capillary sequencing. 

Long mate pair sequencing and verification of structural rearrangements. Long 
mate-paired libraries were made according to Applied Biosystems Mate-Paired 
Library Preparation 5500 Series SOLiD systems protocol using 5 j1g of DNA which 
was sheared using the CovarisS220 System. Long mate pair libraries were sequenced 
using the SOliD v4 (Applied Biosystems). Sequence data was mapped to a genome 
based on the Genome Reference Consortium (http://www.ncbi.nlm.nih.gov/projects/ 
genome/assembly/grc/human/). GRCh37 assembly using bioscope v1.2.1 (Applied 
Biosystems). Each sample was sequenced to an average non-redundant physical cov- 
erage of 180 (64-333) in the tumour and 187 (52-503) in the control sample. 
Structural rearrangements were determined by analysing clusters of discordant read 
pairs using the qSV tool. Events identified by Hiseq sequencing were considered 
verified if the right and left breakpoint of these events were within 500 bases of the 
right and left breakpoint of an event identified by SOLiD sequencing. 

PCR and capillary sequencing for verification of structural rearrangements. 
For PCR and capillary sequencing PCR primers were designed with primer BLAST 
(NCBI) to span the predicted breakpoint, primers were designed with primer BLAST 
(NCBI). PCR was carried out in the tumour and matched normal genomic DNA 
using, respectively, a 25 or 50 il reaction volume composed of 22 or 44 il of Plat- 
inum Taq DNA polymerase (Invitrogen, Carlsbad, Ca), 2 or 4 ul] of 10 LM primer 
(Integrated DNA Technology) and 1 or 2 pl of genomic DNA as template (1 ng pl '). 
The following parameters was used for the PCR. Initial denaturation at 94 °C for 
2 min, followed by 35 cycles of denaturation at 94°C for 30s, annealing at 60 °C 
for 30s and extension at 68 °C for 1 min; followed by final extension at 68 °C for 
15 min. PCR products were visualized by gel electrophoresis and classified into one 
of four categories: (1) validated—strong and specific PCR band of the expected size 
was observed only in the tumour and not in the normal sample, this indicates a 
somatic rearrangement; (2) germline—clear PCR band of the expected size both in 
the tumour and normal; (3) not validated—PCR yields smears or multiple bands, 
this potentially indicates non-specific primer pair; (4) not tested—no PCR band was 
observed in tumour and normal. 

Verification of structural variations—results. In total 7,105 events were verified 
in silico. Of these 5,666 events contained multiple lines of evidence (qSV category 1), 
2,904 events were associated with a copy number change (events classified as dele- 
tion, duplication, tandem duplication, amplified inversion and foldback inversion) 
and 1,871 contained multiple lines of evidence and were associated with a copy 
number change. 

We also verified structural variant events using long mate pair resequencing 
(SOLID paired 50 bp) or sequencing of a different sample from the same patient of 
33 tumours. Using this approach 1,924 events were confirmed and the verification 
status of structural variant events was recorded in Supplementary Table 5 in the 
“Validation_status_id” column where 0 = untested and 1 = verified. In total 7,228 
of the 11,868 events identified (61%) were verified (Supplementary Table 5 and 
Extended Data Fig. 1) the remaining events remain untested. 

Identification of substitutions and small insertion/deletions. Substitutions are 
called using 2 variant callers: SNP’ an in-house heuristics-driven somatic/germline 
caller; and GATK* which is a Bayesian caller. The two callers were chosen because 
they use very different calling strategies and while each maybe subject to artefacts 
(as are all variant callers), they will be subject to different artefacts. Each compared 
variant falls into one of three categories: seen only by qSNP, seen only by GATK, 
and seen by both qSNP and GATK. Mutations identified by both callers or those 
that were unique toa caller and verified by an orthogonal sequencing approach were 
considered high confidence and used in all subsequent analyses (Supplementary 
Table 3). Small indels (<200 bp) were identified using Pindel*; each indel was visu- 
ally inspected in the Integrative Genome Browser (IGV)**. Once somatic mutations 
were called, their effects on any alternative transcripts were annotated using a local 
install of the Ensembl database (v70) and the Ensembl Perl API. 

Verification of substitutions and small insertion/deletions. In total 3,304 of the 
10,335 events identified were verified (Supplementary Tables 3 and 12) the remain- 
ing events remain untested. Substitutions and indels were verified using orthogonal 
sequence data which included data produced on different sequencing platforms 
(Hiseq or SOLiD exome or long mate pair SOLiD sequencing) or data from related 
nucleotide samples (RNA-seq). For example, if orthogonal tumour sequence data 
was available (DNA from a cell line, RNA from the primary sample etc.) anda somatic 
variant was also observed in the second tumour sample then that would add support 
for the variant. It should be noted that tumour samples can only be used to support 
an existing somatic variant and the absence of a called variant in a second tumour 
sample does not discredit the original call. Conversely, a second normal sample will 
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only discredit somatic variants and the absence of the called variant in the second 
normal does not support the original call. This approach is designed to be conser- 
vative. In order to be considered for verification, an additional BAM should have a 
minimum of 10 reads at the variant position, and at least two reads must show the 
variant. If multiple additional BAMs are available, each BAM votes independently 
and the concordance of the votes is used to classify the verification of the variant. 
Each variant examined by qVerify is assigned to one of four categories: 

(1) Verified—one or more additional tumour BAMs showed evidence of the 
variant and no additional normal BAMs showed the variant. 

(2) False Positive—one or more additional normal BAMs showed evidence of 
the variant indication that it is likely to be a germline variant. 

(3) Mixed—across multiple additional BAMs, there was conflicting evidence - 
one or more additional tumour BAMs showed the variant as did one or more addi- 
tional normal BAMs. This could also be evidence of a germline variant incorrectly 
called somatic. 

(4) Untested—there were no additional BAMs or there were additional BAMs 
but none passed the minimum coverage threshold or there were additional BAMs 
that did not show the variant and so did not provide evidence for or against it. 
Telomere length analysis. Reads containing the telomeric repeat (TTAGGG) x3 
or (CCCTAA) X3 were counted and normalized to the average genomic coverage 
(the average base coverage of each genome). The normalized telomere count was 
obtained separately for each tumour and its matching normal. A ratio was calcu- 
lated by tumour normalized counts/normal normalized counts. 

Determination of the BRCA signature. High confidence somatic mutations that 
were called by both qSNP and GATK across the genome were used to determine 
the proportion of the BRCA signature in each sample using a published computa- 
tional framework”. In this way, the 96 substitution classification (as determined 
by substitution class and sequence context) was determined for each sample and 
compared to the validated BRCA signature” and the proportion of the BRCA 
signature in a given sample was ascertained. 

Patient derived xenograft (PDX) mouse model generation. Six female eight- 
week-old NOD/SCID/interleukin 2 receptor [IL2R] gamma (null) (NOG) mice and 
athymic Balb-c-nude mice were used for the establishment of the patient derived 
xenograft (PDX) model. All mice were bred at the Australian Bioresources (ABR) 
under research protocols approved by the Garvan Animal Ethics Committee (09/ 
19, 11/23, 11/09). 

The PDXs were generated according to methodology published elsewhere with 
modifications”. Briefly, surgical non-diagnostic specimens of patients operated 
at APGI clinical sites were implanted subcutaneously (s.c.) into three NOG and 
three Balb-c-nude mice for each patient, with two small pieces per mouse (left and 
right flank; engraftment stage). Once established, tumours were grown to a size of 
1,500 mm’, at which point they were harvested, divided, and re-transplanted into 
further mice to bank sufficient tissues for experimentation (first passage and second 
passage). After expansion, passaged tumours were excised and propagated to cohorts 
of 40 female Balb-c-nude mice or greater at an average of 8 weeks old, which con- 
stituted the treatment cohort (third passage). Utilization of the NOG mouse model, 
which is characterized by high immune deficiency in this study has enabled estab- 
lishment of a significant cohort of PDXs (80) xenografts, with a high rate of suc- 
cessful engraftment and propagation (76%, data not shown). 

In vivo therapeutic testing. Tumour-bearing mice with a palpable tumour (volume 
(V) = 150mm’; V=0.5 X length X width’) were treated with various agents at 
maximum tolerable dose (MTD) or vehicle treatment based on previously estab- 
lished schedules*”*’, where gemcitabine (140 mg per kg) was administered intra- 
peritoneally on day 1 and day 4 for 4 weeks and cisplatin (6 mg per kg) intravenously 
on day 1 and day 14. The investigators were not blinded to the group allocation. To 


avoid accumulating toxicity of repeated injections, an additional treatment was 
given after the recovery time of two weeks only when no tumour regression was 
observed, otherwise treatment was continued once the tumour relapsed to its orig- 
inal size (100%). Measurement of chemotherapy response was based on published 
methodology”', where primary xenografts were treated with the specified mono- 
therapy and their growth characteristics mapped from the time resistance developed 
(characterized by progressive tumour growth in the presence of drug), until eutha- 
nasia. Mice were euthanized and tissues collected for further analyses when tumour 
size reached 400% (600-700mm*). 

RAD51 foci formation assay. Antibodies used included RAD51 (Clone 14B4, 
GeneTex), YH2AX (phospho-histone H2AX Ser129 clone 20E3, Cell signaling), 
and geminin (10802-1-AP, ProteinTech Group, Chicago, IL). Primary culture of 
PDX from patient ICGC_0016was established by plating and growing cells from 
an enzymatically digested xenograft on a collagen matrix for approximately 1 week 
before irradiation and immunofluorescence staining. For this experiment, xenograft 
was established in a NSG-eGFP mouse. This mouse model allowed us to efficiently 
visualize eGFP positive mouse stromal cells and eGFP negative tumour cells under 
the microscope. Briefly, the eGFP expressing NSG mouse was generated in our 
laboratory by crossing previously established heterozygous eGFPNOD.CB17- 
Prkdcscid mice** with the theNOD/SCID/interleukin 2 receptor (IL2R) gamma 
(null) (NOG) strain in our laboratory. eGFP expressing offspring was backcrossed 
five times onto the parental line to ensure homozygosity for IL2Rgamma deletion 
and confirmed by genotyping (Transnetyx). 

Cell lines of interest were grown on coverslips overnight and irradiated with 
10 Gy or left untreated. Subsequently coverslips were fixed with 4% paraformalde- 
hyde (in PBS) 6h post-irradiation and stained with RAD51, yH2AX and geminin 
antibodies as previously described’’. DAPI was used as a nuclear stain. RAD51 focus 
assay scoring was performed as previously established*’. 
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Extended Data Figure 1 | Summary of structural rearrangements. 

a, Histogram showing the number of events verified in silico or by orthogonal 
sequencing methods (Methods). In total 7,228 of the 11,868 events identified 
(61%) were verified, the others remain untested. These included 5,666 

events which contained multiple lines of evidence (qSV category 1: discordant 
pairs, soft clipping on both sides and split read evidence, Methods) thus were 
considered verified. Of these events 2,463 events were also verified by 
orthogonal sequencing methods (SOLiD long mate pair or PCR amplicon 
sequencing) or the event was associated with a copy number change which was 
determined using SNP arrays. The remaining 1,562 events were verified using 
orthogonal sequencing methods or the event was associated with a copy 


? 


number change (qSV category 2 and 3, Methods). b, Histogram showing the 
number of structural rearrangements in each pancreatic cancer. 100 PDACs 
were sequenced using HiSeq paired-end whole-genome sequencing. Structural 
rearrangements were identified and classified into 8 categories (deletions, 
duplications, tandem duplications, foldback inversions, amplified inversions, 
inversions, intra-chromosomal and inter-chromosomal translocations, 
Methods). The number and type of event for each patient is shown. PDAC 
shows a high degree of heterogeneity in both the number and types of events per 
patient. The structural rearrangements were used to classify the tumours into 
four categories (stable, locally rearranged, scattered and unstable, Methods). 


©2015 Macmillan Publishers Limited. All rights reserved 


ARTICLE 


D 
i=} 
i=} 


400 


Total number of 
N 
38 


structural rearrangements 


(=) 


4 Lm | a im 
2 | 1 © one oG 
3 | | | | tt 
4 i | | 
5 a I] a 
6 rT on 
7| O 1 on8 ] |] 
o 8 a tt | ii} an 
€ of it |] |] 
Oo 10 i | 
Bon | a I] 
£ 12/8 oa oe ie | | of 
o 13 a i a a 
= 14 | a | 
oO 15 | a8 
16 |] 
7/8 as 8 on ] 
18 ' oe oe | rT] rT 
19 | a a oe | a: 
20 ] 1 
21 | ' ' 
22 a o oe a 
x o 868 8 uJ a ann 55m 
Stable Locally rearranged Scattered Unstable 


Patients ? 


Extended Data Figure 2 | Distribution of structural variant breakpoints 
within each patient. The 100 patients are plotted along the x axis. The upper 
plot shows the number of structural rearrangements (y axis) in each tumour. 
The lower plot shows which chromosomes (y axis) harbour clusters of 
breakpoints. The distribution of breakpoints (events per Mb) within each 
chromosome for each sample was evaluated using two methods to identify 
clusters of rearrangements or chromosomes which contain a large number of 
events. Method 1: chromosomes with a significant cluster of events were 
determined by a goodness-of-fit test against the expected exponential 
distribution (with a significance threshold of <0.0001). Chromosomes which 
pass these criteria are coloured blue. Method 2: chromosomes were identified 


which contain significantly more events per Mb than other chromosomes for 
that patient. Chromosomes were deemed to harbour a high number of events if 
they had a mutation rate per Mb which exceeds 1.5 times the length of the 
interquartile range from the 75th percentile of the chromosome counts for each 
patient. Chromosomes which pass these criteria are coloured orange. 
Chromosomes which pass both tests they are coloured red. These criteria show 
that the unstable tumours which contain many events often have significant 
clusters of events. In contrast locally rearranged tumours are associated with 
both clusters of events and a high number of events within that chromosome 
when compared to other chromosomes. 
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Extended Data Figure 3 | The stable subtype in pancreatic ductal 
adenocarcinoma. The 20 stable tumours are shown using circos. The coloured 
outer ring represents the chromosomes, the next ring depicts copy number 
(red represents gain and green represents loss), the next is the B allele frequency. 
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The inner lines represent chromosome structural rearrangements detected by 
whole genome paired sequencing and the legend indicates the type of 
rearrangement. Stable tumours contained less than 50 structural 
rearrangements in each tumour. 
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Extended Data Figure 4 | The locally rearranged subtype in pancreatic the B allele frequency. The inner lines represent chromosome structural 
ductal adenocarcinoma. The 30 locally rearranged tumours are shown using __ rearrangements detected by whole-genome paired sequencing and the legend 
circos. The coloured outer rings represent the chromosomes, the next ring indicates the type of rearrangement. In the locally rearranged subtype over 25% 


depicts copy number (red represents gain and green represents loss), the nextis _ of the structural rearrangements are clustered on one of few chromosomes. 
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pancreatic ductal adenocarcinoma (ICGC_0109). Upper plot isadensity plot events similar to chromothripsis. Copy number profile and structural 
showing a concentration of break-points on chromosome 5. Next panel shows __ rearrangements suggest a shattering of chromosome 5 with a high 

the structural rearrangements which are coloured as presented in the legend. _ concentration of structural rearrangements, switches in copy number state and 
The lower panels show copy number, logR ratio and B allele frequency derived _ retention of heterozygosity, which are characteristics of a chromothriptic event. 
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Extended Data Figure 6 | Example of evidence for breakage-fusion-bridge _allele frequency derived from SNP arrays. This chromosome showed a complex 
(BEB) in a pancreatic ductal adenocarcinoma (ICGC_0042). Upper plotisa localization of events similar to BFB. Copy number profile suggests loss of 
density plot showing a concentration of break-points on chromosome 5. telomeric q arm and a high concentration of structural rearrangements 

Next panel shows the structural rearrangements which are coloured as suggesting a series of BFB cycles, with multiple inversions mapped to the 
presented in the legend. The lower panels show copy number, logR ratioandB —_ amplified regions. 


©2015 Macmillan Publishers Limited. All rights reserved 


ARTICLE 


== 


Yess 


Ne 


=|) 
vy} 


© Intra chromosomal rearrangment ® Duplication ~ Inversion Amplified inversion 
@ Inter chromosomal translocation ®@ Tandem duplication © Foldback inversion HH Deletion 
Extended Data Figure 7 | The scattered subtype in pancreatic ductal shows the B allele frequency. The inner lines represent chromosome structural 
adenocarcinoma. The 36 tumours classified as scattered are shown using rearrangements detected by whole genome paired end sequencing. The 


circos. The coloured outer rings represent the chromosomes, the next ring legend indicates the type of rearrangement. The scattered tumours contained 
depicts copy number (red represents gain and green represents loss), the next 50-200 structural rearrangements in each tumour. 
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Extended Data Figure 8 | The unstable subtype in pancreatic ductal whole genome paired sequencing and the legend indicates the type of 
adenocarcinoma. The 14 unstable tumours are shown using circos. The rearrangement. The unstable tumours contained a large degree of genomic 
coloured outer rings are chromosomes, the next ring depicts copy number instability and harboured over 200 structural rearrangements in each tumour 


(red represents gain and green represents loss), the next is the Ballelefrequency. _ which were predominantly intra-chromosomal rearrangements evenly 
The inner lines represent chromosome structural rearrangements detected by _ distributed through the genome. 
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untreated cells derived from an unstable pancreatic tumour with a somatic damage. TKCC-07 is a pancreas cancer cell line generated from a homologous 
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Lagging-strand replication shapes the 
mutational landscape of the genome 


Martin A. M. Reijns'*, Harriet Kemp**, James Ding’, Sophie Marion de Procé?, Andrew P. Jackson! & Martin S. Taylor? 


The origin of mutations is central to understanding evolution and of key relevance to health. Variation occurs 
non-randomly across the genome, and mechanisms for this remain to be defined. Here we report that the 5’ ends of 
Okazaki fragments have significantly increased levels of nucleotide substitution, indicating a replicative origin for such 
mutations. Using a novel method, emRiboSeq, we map the genome-wide contribution of polymerases, and show that 
despite Okazaki fragment processing, DNA synthesized by error-prone polymerase-«a (Pol-a) is retained in vivo, com- 
prising approximately 1.5°% of the mature genome. We propose that DNA-binding proteins that rapidly re-associate 
post-replication act as partial barriers to Pol-6-mediated displacement of Pol-a-synthesized DNA, resulting in 
incorporation of such Pol-a@ tracts and increased mutation rates at specific sites. We observe a mutational cost to 
chromatin and regulatory protein binding, resulting in mutation hotspots at regulatory elements, with signatures of 


this process detectable in both yeast and humans. 


Mutations occur despite the exquisite fidelity of DNA replication, effi- 
cient proofreading and mismatch repair’, resulting in heritable dis- 
ease and providing the raw material for evolution. Genome variation 
is non-uniform’, the outcome of diverse mutational processes’, repair 
mechanisms‘ and selection pressures”®. This variability is exemplified 
by nucleotide substitution rates around nucleosome binding sites, with 
the highest rates at the nucleosome midpoint (dyad position)”. 

Bidirectional replication of genomic DNA necessitates discontinu- 
ous synthesis of the lagging strand as a series of Okazaki fragments 
(OFs)'*"*, which then undergo processing to form an intact continu- 
ous DNA strand’*"*. Recently, the genomic locations at which OFs are 
ligated (Okazaki junctions, OJs) were mapped”. In this experimental 
system, OJs occurred at an average rate of 0.6% per nucleotide; how- 
ever, frequency was strongly influenced by the binding of nucleosomes 
and transcription factors (TFs). These proteins act as partial blocks to 
Pol-6 processivity, resulting in the accumulation of OJs at their binding 
sites. Here, we demonstrate the mutational consequences of such pro- 
tein binding. 


Substitutions correlate with OJs 


We were struck by the similarity of the distribution of Saccharomyces 
cerevisiae OJ sites at nucleosomes” to that previously reported for 
nucleotide substitutions”*’”"”’, and set out to investigate the potential 
reasons for this. We established that nucleotide substitution and OJ dis- 
tributions are highly correlated (Pearson’s correlation coefficient = 0.76, 
P=2.2X10 '°)and essentially identical in pattern (Fig. 1a). Further- 
more, differences in OJ distribution by nucleosome type (genic versus 
non-genic), spacing or consistency of binding were mirrored by the sub- 
stitution rate distribution (Extended Data Fig. la—f). We found similar 
strong correlation in the regions directly surrounding TF binding sites 
of Reb1 (Fig. 1b; Pearson’s correlation = 0.57, P= 5.6 X 107°) and 
Rap1 (Extended Data Fig. 1g), providing further evidence for a direct 
association. At the sequence-specific binding sites themselves, substi- 
tution rates were depressed relative to the OJ, resulting from strong 


selection pressure to maintain TF binding, and obscuring any muta- 
tional signal at these nucleotides. 

Given that both classes of sites (nucleosomes and TFs) are present 
genome-wide and represent different biological processes, this associ- 
ation was probably the direct consequence of protein binding at these 
sites. However, to rule out site-specific biases in sequence as a con- 
founding explanation for the observed distributions, we randomly sam- 
pled the rest of the genome for trinucleotides of identical sequence 
compositions and calculated the substitution rate at these sites, on a 
nucleotide-by-nucleotide position basis (Extended Data Fig. 1h-j). This 
resulted in loss of the observed patterns, establishing that nucleotide 
composition bias was not a contributing factor. Furthermore, the ob- 
served association was not restricted to polymorphism rates, as yeast 
inter-species nucleotide substitution patterns at both nucleosome and 
Reb1 TF binding sites were identical (Extended Data Fig. 1k, 1). 
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Figure 1 | Increased substitution rates at OJs. a, b, Nucleotide (nt) 
substitution rates (red) closely correlate with increased OJ site frequency (blue) 
at nucleosome (a) and Reb1 (b) binding sites. S. cerevisiae polymorphism rates 
per nucleotide computed using sequences from nucleosome (n = 27,586) 

and Reb1 binding sites (n = 881). Individual data points, open circles. Solid 
curves, best-fit splines. Mean, dashed grey line; + 10% dotted grey lines. 


1Medical and Developmental Genetics, MRC Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK. “Biomedical Systems Analysis, MRC 
Human Genetics Unit, MRC Institute of Genetics and Molecular Medicine, University of Edinburgh, Edinburgh EH4 2XU, UK. 


*These authors contributed equally to this work. 


502 | NATURE | VOL 518 | 26 FEBRUARY 2015 


©2015 Macmillan Publishers Limited. All rights reserved 


We therefore concluded that OJ frequency and nucleotide substitu- 
tion rates could be causally related, and set out to investigate the po- 
tential mechanism for this association. 


Mutations at 5’ ends of OFs 


The synthesis and processing of OFs is directional. Therefore, substitu- 
tion rates would be expected to be asymmetrical relative to the direction 
of synthesis, ifa component of this process was the cause. As most of the 
genome is preferentially replicated with either the forward or reverse 
strand as the lagging strand, we orientated regions by their dominant 
direction of lagging-strand synthesis. This revealed substantially in- 
creased nucleotide substitution rates immediately downstream of OJs 
(Fig. 2a), the level of mutational signal correlating with OJ site frequency. 
Quantification of substitution rates for the five nucleotides immediately 
upstream and downstream of the OJ (Fig. 2b) demonstrated that high 
frequency OJ sites (11-fold increased OJ rate relative to baseline; top 
99.9th centile of sites) displayed the highest substitution rate (P < 2.2 
xX 107°), with significant increases (P< 2.2 X 10~ 1) for medium fre- 
quency sites, (6.1-fold, 99-99.9th centile) but not low frequency sites 
(P = 0.3, 1.7-fold, OJ sites <99th centile). This was not due to site- 
specific sequence biases, as the increase in substitution rate was lost 
after a trinucleotide preserving genome shuffle. Therefore, point muta- 
tions are enriched at the 5’ ends of mature OFs of frequently occurring 
OJ sites, sites that correspond to protein barriers to Pol-6 processivity””. 


Pol-a DNA retention hypothesis 


We next considered which aspect of lagging-strand synthesis might be 
responsible. OFs are generated by the consecutive actions of Pol-« and 
Pol-6 (Fig. 2c). When the previously synthesized, downstream OF is 
encountered, OF processing occurs’, involving the coordinated action 
of FEN1 and DNA2 nucleases’*”* in conjunction with continuing DNA 
synthesis by Pol-6, before final ligation of adjoining DNA fragments. 
During this process, most if not all of the 10-30-nucleotide-long DNA 
primer synthesized by Pol-c’?”° has been thought to be removed along- 
side the RNA primer, and replaced by Pol-6-synthesized DNA'®*!, 
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Figure 2 | Frequent nucleotide substitutions at OF 5’ ends. a, Mutation rates 
are increased downstream of OJs. Substitution polymorphisms (red) and OJ 
rate (blue) in regions surrounding high frequency OJs (top 0.1%). n = 5,660 
sequences orientated for dominant direction of OF synthesis. b, Mutation 
rates correlate with OJ peak size. Mutations are significantly enriched 
downstream of the junction (pink), compared to genome shuffle controls (light 
green/pink). Sites grouped by OJ frequency. Points denote mean and error 
bars denote s.d. from 100 bootstrap samples or genome shuffles (controls); 
statistics by paired two-sided t-test. c, Hypothesis: DNA synthesized by non- 
proofreading Pol-a is preferentially trapped in regions rapidly bound by 
proteins post-replication. These act as partial barriers to Pol-6 displacement of 
Pol-a-synthesized DNA, resulting in locally increased mutations. 
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This would be desirable, as unlike other replicative DNA polymerases, 
Pol-« lacks 3’-to-5’ proofreading exonuclease activity, limiting its in- 
trinsic fidelity**. On the other hand, studies on the mutagenesis pattern 
of reduced fidelity polymerase mutants in yeast demonstrate that Pol- 
a-synthesized DNA does contribute to the genome”’****. How com- 
prehensive the removal or retention of such DNA is in vivo is unknown, 
but notably the retention of error-prone Pol-o-synthesized DNA at 
the 5’ end of OFs would provide a straightforward explanation for the 
increased mutation rates we observed. Given that protein barriers have 
been shown to influence OF processing", we therefore propose that Pol- 
a-synthesized DNA is preferentially retained at sites where proteins 
bind shortly after initial OF DNA synthesis (Fig. 2c). Our model would 
predict (1) that Pol-o tracts are retained at a considerable level within 
the mature genome post-replication, and (2) that mutational signatures 
arising from such Pol-c-synthesized DNA will be increased at many 
DNA-binding protein sites in eukaryotes. 


EmRiboSeq 


To address where error-prone Pol-« DNA is retained in vivo, we used 
the incorporation of ribonucleotides into genomic DNA to track the 
activity of specific DNA polymerases. Ribonucleotides are covalently 
incorporated into genomic DNA by replicative polymerases”, although 
they are normally efficiently removed by ribonucleotide excision repair, 
a process initiated by the type 2 RNase H enzyme (RNase H2)”. In 
RNase-H2-deficient budding yeast, such ribonucleotides are generally 
well tolerated: Arnh201 yeast has proliferation rates identical to wild 
type under normal growth conditions”, and therefore in this genetic 
background ribonucleotides can be used as a ‘label’ to track polymer- 
ase activity. Furthermore, the contribution of specific polymerases can 
be studied using polymerases with catalytic site point mutations (such 
as Pol-o(Leu868Met), Pol-6(Leu612Met) and Pol-e(Met644Gly)) that 
incorporate ribonucleotides at higher rates than their wild-type coun- 
terparts (refs 21, 26, 27, 30 and J. S. Williams, A. R. Clausen & T. A. 
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Figure 3 | Mapping DNA synthesis in vivo using emRiboSegq. a, Replicative 
polymerases can be tracked using point mutants with increased ribonucleotide 
incorporation. Schematic of replication fork with Pol-e (asterisk denotes 
Met644Gly mutant) and ribonucleotide incorporation rates for each 
polymerase. Embedded ribonucleotides (R) highlighted. b, Schematic of 
emRiboSeq methodology. c, Schematic of replication. d, e, Mapping of leading/ 
lagging-strand synthesis and replication origins using emRiboSeq. Ratio of OF 
reads’’ between forward and reverse strands of chromosome 10 (Chr10; 

d) corresponds to the ratio of their respective ribonucleotide content (e) for 
Pol-8* (orange), whereas Pol-e* (cyan) shows negative correlation. 
Intersections with x axis correspond to replication origins and termination 
regions (c-e). Experimentally validated origins (dotted pink lines). f, Pol-«* 
DNA is detected genome-wide by emRiboSeq as a component of the lagging 
strand. Strand ratios are shown as best-fit splines, y axes denote log, of 

ratios (d-f). 
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Kunkel, personal communication; Fig. 3a). Yeast strains expressing these 
mutant polymerases have previously been used to demonstrate that 
Pol-g and Pol-6 are the major leading- and lagging-strand polymerases, 
respectively, by measuring strand-specific alkaline sensitivity of parti- 
cular genomic loci”. 

To track directly the genome-wide contribution of polymerases, we 
developed a next-generation sequencing approach, which we term 
emRiboSeq (for embedded ribonucleotide sequencing), that determines 
the strand-specific, genome-wide distribution of embedded ribonucleo- 
tides. This is achieved by treatment of genomic DNA with recombinant 
RNase H2 to generate nicks 5’ of embedded ribonucleotides, followed 
by ligation of a sequencing adaptor to the 3’-hydroxyl group of the 
deoxynucleotide immediately upstream of the ribonucleotide (Fig. 3b 
and Extended Data Fig. 2a). Subsequent ion-semiconductor sequencing 
permits strand-specific mapping of ribonucleotide incorporation sites. 

Control experiments using endonucleases of known sequence spe- 
cificity demonstrated 99.9% strand specificity and 99.9% site specificity 
for the technique (Extended Data Fig. 2b-d). Using RNase-H2-deficient 
Pol-e(Met644Gly) and Pol-6(Leu612Met) yeast strains, we then mapped 
the relative contributions of these respective polymerases genome-wide 
(Fig. 3c-e and Extended Data Figs 3 and 4). We found that ribonucleo- 
tide incorporation in the Pol-d(Leu612Met) strain was substantially 
enriched on the DNA strand that is preferentially synthesized by lagging- 
strand synthesis’’, in keeping with its function as the major lagging- 
strand polymerase*’****, while ribonucleotide incorporation in the Pol-¢ 
(Met644Gly) strain exhibited an entirely reciprocal pattern consistent 
with its function as the leading-strand polymerase*’”® (Fig. 3e). Fur- 
thermore, points at which neither enzyme showed strand preference 
(intersection of both Pol-¢ and Pol-6 plots with the x axis) corresponded 
precisely with annotated origins of replication. Other intersection points 
were also evident that correspond to replication termination regions, as 
well as putative, non-annotated origins. The latter overlapped with early 
replicating regions** (Extended Data Fig. 3b, c). Therefore, we con- 
cluded that emRiboSeq can be used to determine the distribution of 
polymerase activity genome-wide, and has utility for the identification 
of replication origin and termination sites. 


Pol-a-synthesized DNA ~1.5°% of genome 


Having demonstrated the validity of our technique through detailed 
mapping of the major replicative polymerases, we next examined the 
contribution of Pol-a-synthesized DNA to the budding yeast genome. 
Significantly, the Pol-o«(Leu868Met) Arnh201 strain hada strand ratio 
distribution identical to that seen for Pol-6(Leu612Met) Arnh201, con- 
sistent with the expected role for Pol-o in lagging-strand replication 
(Fig. 3f). Furthermore, the Pol-o(Leu868Met) pattern of strand incorp- 
oration was reciprocal to that of a wild-type polymerase strain (POL), 
which displayed leading-strand bias, in keeping with a strong propen- 
sity for ribonucleotide incorporation by leading-strand polymerase Pol- 
€ compared to Pol-é (ref. 37). Increased ribonucleotide retention on the 
lagging strand was also present in DNA from stationary phase Pol- 
o(Leu868Met) Arnh201 yeast (Extended Data Fig. 3d), demonstrating 
that Pol-x-derived DNA is retained in the mature genome post- 
replication and that this signal was not due to the transient presence 
of Pol-~ DNA during S-phase. 

To provide biochemical validation, we performed alkaline gel elec- 
trophoresis on genomic DNA extracted from Pol-c(Leu868Met), Pol-6 
(Leu612Met) and Pol-e(Met644Gly) Arnh201 yeast. Increased frag- 
mentation was detected in all three strains (Extended Data Fig. 4a-c) 
and increased ribonucleotide incorporation was also detected in geno- 
mic DNA from stationary phase Pol-o(Leu868Met) yeast (Fig. 4a—c), 
consistent with Pol-« tract retention in mature genomic DNA. To quan- 
tify the contribution of Pol- DNA to the genome, we used densito- 
metry measurements from the alkaline gels to calculate ribonucleotide 
incorporation rates’. We detected 1,500 embedded ribonucleotides 
per genome in Arnh201 genomic DNA, which increased to 2,400 sites 
per genome for Pol-«(Leu868Met) (Fig. 4c). Observed ribonucleotide 
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Figure 4 | Pol-a DNA synthesis contributes ~1.5% of the mature genome. 
a, b, Increased ribonucleotide incorporation in Pol-c* stationary phase yeast is 
detected by alkaline gel electrophoresis. kb, kilobases; WT, wild type. 

c, Quantification confirms significantly increased rates in the Pol-«.* genome 
(n = 6 independent experiments; error bars denote s.e.m.; statistics by paired 
two-sided t-test). d, Estimate of relative contribution of polymerases to the 
genome (n = 4 independent experiments; error bars denote s.e.m.). 


incorporation rates correspond to the product of the incorporation fre- 
quency of each polymerase and the amount of DNA it contributes to 
the genome. Using the in vitro ribonucleotide incorporation rates of 
wild-type and mutant polymerases and the number of embedded ribo- 
nucleotides embedded in vivo (Extended Data Figs 3a and 4a-c), we 
estimated the relative contributions of each of the replicative polymer- 
ases to the genome (Fig. 4d), calculating the contribution of Pol- to be 
1.5 + 0.3% (mean = s.d.). 

RNase H enzymes may contribute to the removal of OF RNA 
primers’®** and consequently Arnh201 strains could have altered levels 
of Pol-a-synthesized DNA to that seen in wild-type strains. This con- 
founding factor was excluded using an RNH201 separation-of-function 
mutant*’, which established that retention of Pol-o, DNA was indepen- 
dent of a role for RNase H2 in RNA primer removal (Extended Data 
Fig. 5). 

In conclusion, Pol-«-synthesized DNA makes a small but significant 
contribution to the genome, relative to the major replicative polymer- 
ases, confirming the first prediction of our model. 


Mutational cost of TF binding in humans 

As OF processing is a conserved process in eukaryotes, we next con- 
sidered whether an OF-related mutational signature was also present 
in humans. Substitution rates are also increased at nucleosome cores 
in humans’ with an identical distribution to yeast. Furthermore, the 
TE NFYA has an unexplained ‘shoulder’ of increased substitution prox- 
imal to its binding sites”, reminiscent of the Reb] pattern (Fig. 1b). We 
therefore investigated whether similar mutational patterns are present 
at other experimentally defined human TF and chromatin protein bind- 
ing sites. Increased inter-species nucleotide substitution rates were de- 
tected flanking essential binding site residues, for many, but not all 
TFs, as well as CTCF binding sites (Fig. 5a, b and Extended Data Fig. 6). 
Substitution rates were measured using genomic evolutionary rate pro- 
filing (GERP) scores, which quantify nucleotide substitution rates rela- 
tive to a genome-wide expectation of neutral evolution“, such that a 
negative GERP score indicates increased nucleotide substitution rates. 
Furthermore, increases in mutation rate correlated with the degree of 
enrichment reported in chromatin immunoprecipitation with lambda 
exonuclease digestion (ChIP-exo) data sets for these proteins, likely re- 
flecting the strength of binding or frequency of occupancy at specific 
sites, which would be expected to influence Pol-6 processivity and con- 
sequent mutation levels. 

Finally, to extend our analysis beyond common TF binding sites, we 
investigated whether the same mutational signature could be found for 
a broad range of regions at which regulatory proteins bind, regions we 
identified by the presence of DNase I footprints. Our preceding ana- 
lysis of TFs suggested that nucleotide substitutions would be increased 
immediately adjacent to the protein binding region defined by such foot- 
prints. In yeast we found that DNase I footprint edges served as a good 
proxy for increased OJ rate with significantly elevated substitution rates 
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Figure 5 | OF mutational signatures are conserved in humans. a, Nucleotide 
substitutions (plotted as GERP scores) are increased immediately adjacent to 
TF NFYA binding sites (n = 5,110). Pink to brown: lower to higher quartiles of 
ChIP-exo peak height (reflecting strength of binding/occupancy). Stronger 
binding correlates with substitution rate in the ‘shoulder’ region (asterisk). 
b, Increased substitution rates are not a consequence of local sequence 
composition effects. Strongest binding sites (brown) compared to trinucleotide 
preserving shuffle (black). c, Model showing nucleotide substitution profiles 
are the sum of mutation rate and selective pressure. d, Interspecies substitution 
rates are also increased adjacent to DNase I footprint edges (asterisk) 
(n = 33,350). Sequences aligned to left footprint edges as indicated in 
schematic. Right footprint edge is indistinct owing to heterogeneity in footprint 
length. Substitution rates are no longer increased after trinucleotide preserving 
shuffle from local flanking sequences (black). Brown dashes and grey 
shading denote 95% confidence intervals (b, d). 


(Extended Data Fig. 7). Similarly, in humans, aligning regions contain- 
ing DNase I footprints on the basis of boundary junctions (left-hand edge 
of footprint), detected substantially increased nucleotide substitution 
rates close to the junction, relative to the baseline rate in the immediate 
region (Fig. 5d). These increased substitution rates were related to 
position rather than sequence content, as this signal was lost when a 
trinucleotide preserving genome shuffle was applied, both for indivi- 
dual TFs (Fig. 5b and Extended Data Fig. 6a—d) and DNase I footprints 
(Fig. 5d). Therefore, this mutational signature is not due to the reten- 
tion of mutagenic sequences (for example, CpG dinucleotides) at such 
sites’, and is a widespread phenomenon in the genome at protein bind- 
ing sites in both yeast and humans. 


Discussion 


Here we establish a mutational signature at protein binding sites that 
we suggest could result from the activity of the replicative polymerase 
Pol-«. We use a novel technique, emRiboSeq, to demonstrate that error- 
prone DNA synthesized by Pol-« is retained in the mature lagging 
strand. EmRiboSeq tracks genome-wide in vivo polymerase activity 
using ribonucleotides as a ‘non-invasive’ label, and will have signifi- 
cant future use for the in vivo study of DNA polymerases in replication 
and repair. Further optimization of emRiboSeq should permit high re- 
solution examination of the role of polymerases at specific sites, such 
as Pol-o tract retention at protein binding sites. It will also be a useful 
method for defining replication origin and termination sites, and fur- 
thermore will facilitate the investigation of physiological roles of genome- 
embedded ribonucleotides****™. 

A direct relationship between OF junctions and mutation frequency 
is indicated by the significant correlations between substitution rate and 
OF junction sites at diverse protein binding sites, although future ex- 
perimental validation will be needed to establish causality formally. We 
find that substitution rates are specifically increased downstream of such 
junction sites, suggesting a replicative origin for such mutations. As 
Pol-o DNA tracts occur genome-wide, and Pol-6 processing of OFs is 
impaired by DNA-bound proteins”, we propose that retention of Pol- 
a DNA is increased at these functionally important sites, and is re- 
sponsible for the increased mutation rate (Fig. 2c). Replication fidelity 
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processes, including efficient mismatch repair at the 5’ end of OFs***, 


will mitigate Pol-o replication errors. Additionally, Pol-« DNA will be 
incorporated at relatively low frequency (Extended Data Fig. 8), with 
most DNA at such sites still synthesized by Pol-é and Pol-e. However, 
over evolutionary timescales, it seems that these processes are insuf- 
ficient to compensate fully for the lack of Pol-« proofreading activity. 
An alternative possibility is that protein binding may impair access of 
replication-related repair factors, such as Exol to correct errors in Pol- 
a-synthesized DNA**. However, it does not appear that the mismatch 
repair machinery is generally obstructed at such sites, as mismatch 
repair efficiency at nucleosomes is reported to be uniform with respect 
to dyad position”. 

Nucleosome formation has a key role in ensuring genome stability**, 
and consequently there is an imperative for the rapid repackaging of the 
genome post-replication. However, we now show that this comes at the 
cost of increased mutation at specific sites, detectable on an evolution- 
ary timescale. OF-associated mutagenesis could also have importance 
for human genetics, as it increases mutation rates at TF and regulatory 
protein binding sites. Such increased mutagenesis has been substan- 
tially obscured by strong purifying selection at these sites necessary to 
maintain functionality. Notably, increased mutation suggests that they 
will be evolutionary hotspots, and may help to explain the rapid evolu- 
tionary turnover of TF sites” and the difficulty in non-coding functional 
site prediction by interspecies sequence conservation comparisons. Fur- 
thermore, as hyper-mutable loci, TF binding sites may be frequently 
mutated in inherited disease and neoplasia. 

In summary, we demonstrate that DNA synthesized by Pol-« con- 
tributes to the eukaryotic genome, probably increasing mutations at 
specific regulatory sites of relevance to both human genetics and the 
shaping of the genome during evolution. 

Note added in proof: Three studies, published concurrently with this 
paper, have independently developed similar methods to determine the 
genome-wide distribution of embedded ribonucleotides***?*°, demon- 
strating the utility of ribonucleotides as markers of replication enzymo- 
logy in budding yeast. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


Yeast reference genome and annotation. All analyses were performed on the 
sacCer3 (V64) S. cerevisiae reference genome assembly. Data sets originally ob- 
tained with coordinates on other assemblies were projected into the sacCer3 assem- 
bly using liftOver (v261)*' with the corresponding chain files obtained from http:// 
www.yeastgenome.org. All regions of the sacCer3 genome were used for read align- 
ment but analyses including strand ratios and all rate estimates excluded the fol- 
lowing multi-copy regions: the mitochondrial genome, rDNA locus chrXI1I:459153 
-461153 and any 100-nucleotide segment with mappability score of <0.9 (gem- 
mappability”? with k-mer = 100). In total, this masked 951,532 nucleotides (7.8%) 
of the reference genome. Gene structure annotations were the Saccharomyces Ge- 
nome Database (SGD) consensus annotations extracted from the University of Cal- 
ifornia, Santa Cruz (UCSC) genome browser in November 2013. Annotated origins 
of replication were obtained from ref. 53. DNase I hypersensitive sites and foot- 
prints were obtained from ref. 54, and nucleosome position, occupancy and posi- 
tional fuzziness (positional heterogeneity) measures were from ref. 55. Yeast replication 
timing data was obtained from ref. 36, where we have plotted the percentage of 
heavy-light (replicated) DNA (pooled samples data set). Higher percentage indi- 
cates earlier average replication time. 

Yeast polymorphisms and between species substitution rates. Yeast polymor- 
phism data was obtained from the Saccharomyces Genome Resequencing project”®. 
A polymorphic difference between any of the 37 sequenced S. cerevisiae strains was 
called as a polymorphic site. Sites with n > 2 alleles were only counted once as a 
polymorphic site. Only nucleotide point substitutions were considered, insertions 
and deletions were excluded. The polymorphism rate reported is the number of 
polymorphic sites divided by the number of sacCer3 sites with sequence coverage 
in at least two additionally sequenced strains. 

Yeast between-species substitution rates were calculated from MultiZ stacked 
pairwise alignments obtained from the UCSC genome browser (Supplementary 
Table 1). Alignments for five sensu stricto yeast species (S. cerevisiae, S. paradoxus, 
S. mikatae, S. kudriavzevii and S. bayanus) were extracted from the original seven 
species alignment. The reference assembly names and phylogenetic relationship 
are represented by the tree (((sacCer3, sacPar), sacMik), sacKud, sacBay). Substi- 
tution rates were calculated over whole chromosomes using baseml from the paml°” 
package (version 4.6) under the HKY85 substitution model with ncatG = 5 cat- 
egorical gamma. Per-nucleotide relative rate estimates (branch length multipliers) 
were obtained over the sacCer3 genome. 

Human conservation measures. GERP scores*’ were used as a measure of be- 
tween species nucleotide diversity across 46 vertebrate species. Single-nucleotide 
resolution bigWig files were obtained from UCSC genome browser (hg19). For 
consistency of presentation with plots of polymorphism rate and yeast between- 
species nucleotide substitution rate, the y axes in plots showing GERP scores have 
been inverted so that greater constraint is low and greater diversity is high. 

OF sequence processing. OF sequence data was obtained from ref. 17 (GEO ac- 
cession GSM835651). Analysis primarily focused on the larger ‘replicate’ library 
but results were confirmed in the ‘sample’ library (GEO accession GSM835650). 
The OF strand ratio was calculated as the sum of per nucleotide read coverage on 
the forward strand divided by the same measure for reverse strand reads. OF strand 
ratios were calculated in windows of 2,001 nucleotides. A pseudo count of 1 read- 
covered nucleotide was added to both strands in each window to avoid divisions by 
zero. Results shown are for de-duplicated read data (identical start and end coor- 
dinates were considered duplicates). De-duplication minimises potential biases in 
PCR amplification, qualitatively similar results were obtained with non-de-duplicated 
data and support identical conclusions. 

Rather than using separate Okazaki 5’ and 3’ end counts that did not always 
correlate well, probably due to amplification biases, sequencing and size selection 
biases; we produced a normalized OJ rate measure. This is the average of (1) the 
fraction of upstream OFs that terminate with a 3’ end at a focal nucleotide, and 
(2) the fraction of downstream OFs whose 5’ end is at the focal nucleotide. The 
upstream and downstream coverage measures were based on mean Okazaki read 
coverage for the nucleotides located between 5 and 12 nucleotides upstream (down- 
stream) of the focal 3’ (5’) end. This OJ rate was calculated at single nucleotide 
resolution over both strands of the sacCer3 genome. 

EmRiboSeq alignment and processing. Sequence reads (see Supplementary 
Table 2 for runs and read numbers) were aligned to the unmasked sacCer3 ge- 
nome with bowtie2 (version 2.0.0). Subsequent filtering and format conversion 
were performed using Samtools (version 0.1.18) and BEDTools (version 2.16.2). 
Only reads with a mapping quality score >30 were kept for analysis. As there had 
been no pre-sequencing amplification, de-duplication was not performed. Read 
5'-end counts were summed per strand at single nucleotide resolution over the 
yeast reference genome. Note that under the emRiboSeq protocol, the ribonucle- 
otide incorporation site would be one nucleotide upstream and on the opposite 
strand to the mapped read 5’ end. To facilitate comparison between libraries of 
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differing read depth, read counts were normalized to sequence tags per million 
mapped into the non-masked portion of the genome. 
Defining TF binding sites. Reb1 and Rap1 ChIP-exo data was obtained from ref. 58 
(Sequence Read Archive accession SRA044886). Sequence bar codes were clipped 
and sequences sorted using Perl (version 5.18.2). Reads were aligned using bowtie2 
(version 2.0.0). Following the previously published protocol® up to three mis- 
matches across the length of each tag sequence were allowed, and the 3’ most 6 
base pairs (bp) removed. Peaks were called with MACS (version 2.0.10). Following 
ref. 58, sites were defined as monomer if no other peaks were present within 
100 bp. Where two or more peaks were present within 100 bp the peak with the 
highest occupancy was labelled as the primary peak. Telomeric sites were excluded 
using annotations within the sacCer3 sgdOther UCSC table (http://www.yeast 
genome.org). The presence or absence of a motif was determined using the Motif 
Occurrence Detection Suite (MOODS)” (version 1.0.1). Consensus binding motifs 
positional weight matrices were obtained from JASPAR® (http://jaspar.genereg. 
net/). The matching motif significance threshold was set at 0.005. Multiple peaks 
were aligned (x = 0) to the midpoint of the JASPAR defined motif. Human TF 
binding sites were defined using ChIP-seq data (Supplementary Table 1) as for yeast, 
except that the peak clustering threshold was reduced to 50 nucleotides. 
Computational and statistical analyses. Analysis and all statistical calculations 
were performed in R (version 3.0.0). Lines of fit used the smooth.spline function 
with degrees of freedom: Fig. 1a, 18 degrees; Fig. 1b, 34 degrees; Fig. 3d-f, 80 
degrees of freedom (strand ratio calculated in 2,001-nucleotide consecutive win- 
dows). Sliding window averages used the rollapply function from the Zoo package 
with centre alignment and null padding. Pearson’s correlation was performed with 
the cor.test function in R, paired Student’s t-test with the t.test function, Mann- 
Whitney tests with the wilcox.test function and lowess (locally weighted scatterplot 
smoothing) with the lowess function and default parameters. 

No statistical methods were used to predetermine sample size. 
Rate estimates with compositional correction. Polymorphism and OJ rates were 
calculated separately for each nucleotide (A, T, C or G) and the average of these for 
rates used as the reported or plotted measure for a nucleotide site or group of sites. 
This corrects for mononucleotide compositional biases that are abundant when 
sampling specific features of a genome. The between-species relative substitution 
rate calculation incorporates a compositional correction. The rate estimates shown 
are the number of observations divided by the number of sites with non-missing 
data. 
Trinucleotide preserving shuffles. Every nucleotide of the sacCer3 genome was 
assigned to one of 64 categories based on the identity of that nucleotide and its 
flanking nucleotides. A vector of transformations was produced by swapping the 
genomic coordinate of a nucleotide for one with an identical category chosen at 
random. Swaps between masked and unmasked sites (see above) were prevented. 
100 such vectors were produced. For a set of stacked coordinates (for example, 
Fig. 1a comprising 27,586 sequences, each of 251 nucleotides), every nucleotide of 
every sequence was substituted through the transformation vector, for a randomly 
selected proxy, matched for the same trinucleotide context and their correspond- 
ing rate or annotation used. This provides a compositionally well-matched null 
expectation. With 100 independent transformation vectors we provide empirically 
derived 95% confidence bounds and standard deviations on those null expecta- 
tions. For human sites, shuffles were confined to sequences flanking the region of 
interest (100-300 nucleotides distant from the binding site for TF analysis and 
1,000-2,000 nucleotides distant for DNase I footprint analysis). Human genomic 
coordinates in the ENCODE ‘Duke Excluded Regions’ and those positions with a 
uniqueness score of <0.9 (gem-mappability” with k-mer = 100) were excluded 
from shuffles. 
Sites selected for analysis. Thresholds were applied to define specific subsets of 
sites to be evaluated. For the presented data (Fig. 1a) nucleosomes with an occu- 
pancy of >80%, positional fuzziness” of <30, with at least 30 OF reads over them, 
and located more than 200 nucleotides from transcription start sites were used. 
Other combinations (Extended Data Fig. 1) of these parameters gave qualitatively 
similar results and support the same conclusions. Reb1 (and Rap1) sites were defined 
as the primary ChIP-exo peak ata site, with sequences aligned (x = 0) to the centre of 
the highest scoring Reb1/Rap1 position weight matrix match within 50 nucleotides 
of the ChIP-exo peak summit. DNase I footprints from 41 human cell types were 
previously combined” into consensus footprints (combined.fps.gz). We intersected 
the combined footprints with those found in each cell type using BEDtools (version 
2.17.0) to identify the subset (n = 33,530) that were detected in all 41 cell types. The 
left-edge coordinate as defined in the combined footprint file was used as the focal 
nucleotide (x = 0) for analysis. 
Comparison of polymorphism rates. The five nucleotide positions downstream 
and the five upstream of the focal OJ position (excluding x = 0 in both cases) were 
scored for their polymorphism rate (Fig. 2b). Rate deltas were calculated as up- 
stream minus downstream in 100 bootstrap replicates and a paired two-sided t-test 
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performed against the same calculation performed on 100 trinucleotide preserving 
genome shuffles of the same sites. This tests whether the difference in rate between 
upstream and downstream positions is greater in the observed data than the shuf- 
fled data. 
DNA purification. Yeast strains were grown at 30 °C in YPDA to mid-log phase 
(see Supplementary Table 3 for a list of strains) or to saturation for stationary 
phase. Per 5 A¢oonm units, cell pellets were resuspended in 200 tl lysis buffer (2% 
Triton X-100, 1% SDS, 0.5 M NaCl, 10 mM Tris-HCl pH 8.0, 1 mM EDTA). An 
equal volume of TE-equilibrated phenol and glass beads (0.40-0.60 mm diameter, 
Sartorius) were added, and cells lysed by vortexing for 2 min; 200 pl TE buffer was 
then added, followed by an additional 1 min of vortexing. After centrifugation, the 
aqueous phase was further extracted with equal volumes of phenol:chloroform: 
isoamylalcohol (25:24:1) and chloroform. Total nucleic acids were precipitated 
with 1 ml of 100% ethanol, and dissolved in 0.5 M NaCl. RNA was degraded by 
treatment with 10 ug RNase A (Roche) for 1h at room temperature. DNA was 
finally purified with an equal volume of Ampure XP beads (Beckman Coulter) and 
eluted in nuclease-free water. For library preparations DNA was isolated from up 
to 40 Agoo nm units. 
Alkaline gel electrophoresis. Isolated genomic DNA (0.5 11g) was treated with 
recombinant RNase H2, purified as previously described® and ethanol precipitated. 
DNA pellets were dissolved in alkaline loading dye and separated on 0.7% agarose 
gels (50mM NaOH, 1mM EDTA) as previously described’’, and stained with 
SYBR Gold (Life Technologies). Densitometry measurements and derivation of 
ribonucleotide incorporation rates as previously described”*. Percentage genome 
contribution for each replicative polymerase (x) was calculated using the follow- 
ing formula: NApotx’Fpotx/(Npola’Fpola + Npotd’Fpoid + Npote’Fpote), With Napoix the 
number of ribonucleotides incorporated in one yeast genome for the mutant poly- 
merase, above that detected in the Arnh201 POL strain, measured on the same 
alkaline gel, and F,,.1, the frequency of incorporation by that polymerase (see Fig. 3a). 
EmRiboSeq library preparation and sequencing. DNA was sonicated using a 
Bioruptor Plus (Diagenode) to achieve an average fragment length of approxi- 
mately 400 bp. Fragmented DNA was concentrated by ethanol precipitation and 
size selected using 1.2 volumes of Ampure XP. DNA was quantified by nanodrop 
(Thermo Scientific) and up to 5 jig was used for NEBNext End Repair and dA- 
Tailing (New England Biolabs) following the manufacturer’s guidelines. After the 
end-repair reaction, DNA was purified using 1.2 volumes of Ampure XP. Sub- 
sequent steps were performed in the presence of Ampure XP beads, capturing the 
DNA by adding NaCl and PEGg oq to final concentrations of 1.25 M and 10%, 
respectively. The trP1 adaptor (see below) was attached using NEBNext Quick 
Ligation with 120 pmol of adaptor per microgram of DNA for 14-18h at 16°C. 
Terminal transferase (NEB) was then used to block any free 3’ ends with ddATP 
for 2h at 37 °C, with 20 U of TdT per microgram of DNA. After Ampure XP pu- 
rification, beads were removed and DNA nicked using recombinant RNase H2 
(10 pmol pg! of library) or Nb.BtsI (NEB; 10 U ug’) for 2h at 37°C. RNase H2 
purification and reaction conditions were as previously described®. Enzymes were 
inactivated by heating at 80 °C for 20 min, and DNA was purified using 1.8 volumes 
of Ampure XP. Shrimp alkaline phosphatase (Affymetrix; 5 U) was then used to re- 
move 5’ phosphates at 37 °C (1 h per pug of library). After heat inactivation for 15 min 
at 65 °C and Ampure XP purification, DNA was denatured by heating at 95 °C for 
5 min and snap cooling. Subsequently, A adaptor (see below; 120 pmol 1g“! of li- 
brary) was attached using NEBNext Quick Ligation for 14-18 h at 16 °C. Fragments 


with biotinylated A adaptor were captured on streptavidin-coupled M-280 Dynabeads 
(Life Technologies) following the manufacturer’s guidelines, and non-biotinylated 
strands were released in 0.15 M NaOH. Single-stranded fragments were concen- 
trated by ethanol precipitation. 

Phusion Flash High-Fidelity PCR Master Mix (Thermo Scientific) was then used 
for second strand synthesis with primer A to produce a double stranded library. 
Size selection of fragments between 200 and 300 bp in size was performed using 2% 
E-Gel EX (Life Technologies). Finally, this library was quality checked and quan- 
tified using a 2100 Bioanalyzer (Agilent Technologies) before emulsion PCR, using 
the Ion Torrent One Touch, and next generation sequencing on the Ion Torrent 
PGM or Proton platform (Life Technologies). 

Oligonucleotides and adaptor design. Custom oligonucleotides were synthe- 
sized by Eurogentec. Adaptor primer pairs were annealed by heating at 95 °C 
for 5 min and cooling gradually. Sequences of the adaptor primer pairs were as 
follows. Adaptor 1 (trP1): trP1-top, 5’-CCTCTCTATGGGCAGTCGGTGAT- 
phosphorothioate-T-3’; trP1-bottom, 5’-phosphate-ATCACCGACTGCCCAT 
AGAGAGGC-dideoxy-3’. Adaptor 2 (A): A-top, 5'-phosphate-CTGAGTCGGA 
GACACGCAGGGATGAGATGG-dideoxy-3'; A-bottom, 5’-biotin-CCATCTC 
ATCCCTGCGTGTCTCCGACTCAGNNNNNN-C3 phosphoramidite-3’. The 
sequence for primer A used in second strand synthesis was 5’-CCATCTCATC 
CCTGCGTGTCTCCGAC-3’. 

Data sources, sequencing data and S. cerevisiae strains. Documented in Sup- 
plementary Tables 1-3. 
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Extended Data Figure 1 | Increased OJ and polymorphism rates correlate at 
binding sites of different nucleosome classes and at Rap1 binding sites. 
a-f, OJ and polymorphism rates are strongly correlated for different classes of 
nucleosomes. Data presented as in Fig. 1a, for different sub-classes of 

S. cerevisiae nucleosomes, demonstrating that OJ and polymorphism rates co- 
vary in all cases. Transcription start site proximal nucleosomes (d) are probably 
subject to strong and asymmetrically distributed selective constraints, which 
is likely to explain the modestly reduced correlation for this subset. Such 
transcription start site proximal nucleosomes were excluded from analyses of 
other categories presented (b, ¢, e, f), except ‘all nucleosomes’ (a). g, OJ and 
polymorphism rates are correlated for the S. cerevisiae TF, Rap1. Data 
presented, as for Reb1 in Fig. 1b, show increased OJ and polymorphism rates 


around its binding site, with a dip corresponding to its central recognition 
sequence. h-j, Increased polymorphism and OJ rates at Rap1 (h), nucleosome 
(i) and Reb1 (j) binding sites are not caused by biases in nucleotide content. 
Distributions calculated as for g, Fig. 1a and b, respectively, using a 
trinucleotide preserving genome shuffle. Pink shaded areas denote 95% 
confidence intervals for nucleotide substitution rates (100 shuffles). 

k, 1, Polymorphism (red) and between-species (black) substitution rates are 
highly correlated for nucleosome (k) and Reb! (1) binding sites. Best fit splines 
shown only. y axes scaled to demonstrate similar shape distribution. 

Values plotted as percentage relative to the mean rate for all data points (central 
11 nucleotides excluded for calculation of mean in g, 1). 
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Extended Data Figure 2 | EmRiboSeq methodology and validation. 

a, Schematic of emRiboSeq library preparation. rN, ribonucleotide. 

b-d, Validation of strand-specific detection of enzymatically generated nicks 
through linker-ligation. Nb.BtsI nicking endonuclease cleaves the bottom 
strand of its recognition site releasing a 5’ fragment (cyan) with a free 3'-OH 
group after denaturation, to which the sequencing adaptor (pink) is ligated, 
allowing sequencing and mapping of this site to the genome (b). Nb.BtsI 
libraries have high reproducibility between Arnh201 POL and Arnh201 Pol-a* 
(poll-L868M) strains after normalizing read counts to sequence tags per million 


Distance from Nb.Btsl nick (nt) 


(TPM). Bona fide Nb.BtsI sites were equally represented, at maximal frequency, 
in both libraries (c). Those with lower frequencies represented sites in close 
proximity to other Nb.BtsI sites, causing their partial loss during size selection. 
Additionally, Nb.BtsI-like sites were detected as the result of star activity. 
Libraries were also prepared using BciVI restriction enzyme digestion, that did 
not show such star activity (data not shown), allowing calculation of the site 
specificity for the method (>99.9%). Summed signal at Nb.BtsI sites shows 
>99.9% strand specificity (blue, correct strand; grey, opposite strand) and 
>99% single nucleotide resolution (d). 
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Extended Data Figure 3 | Mapping replicative polymerase DNA synthesis 
using emRiboSeq. a, Point mutations in replicative polymerases elevate 
ribonucleotide incorporation rates, permitting their contribution to genome 
synthesis to be tracked. Schematic of replication fork with polymerases and 
their ribonucleotide incorporation rates (refs 27, 30 and J. S. Williams, 

A. R. Clausen & T. A. Kunkel, personal communication) as indicated (POL 
denotes wild-type polymerases; asterisk denotes point mutants). Embedded 
ribonucleotides indicated by ‘R’; additional incorporation events due to 


polymerase mutations highlighted by shaded circles. b, c, Mapping of leading/ 


lagging-strand synthesis by Pol-5* and Pol-s* yeast strain using emRiboSeq (as 
in Fig. 3) highlights both experimentally validated (pink dotted lines) and 
putative (grey dotted lines) replication origins. These often correspond to 
regions of early replicating DNA” (c). d, Pol-a* DNA is detected genome-wide 
by emRiboSeq as a component of the lagging strand in stationary phase yeast, 
as shown by the opposite pattern for a polymerase wild-type strain. Strand 
ratios are shown as best-fit splines with 80 degrees of freedom, y axes show log, 
of the strand ratio calculated in 2,001-nucleotide windows (b-d). 
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Extended Data Figure 4 | Quantification of in vivo ribonucleotide 
incorporation by replicative polymerases. a, b, Representative alkaline gel 
electrophoresis of genomic DNA from yeast strains with mutant replicative 
DNA polymerases (a), with accompanying densitometry plots (b). Embedded 
ribonucleotides are detected by increased fragmentation of genomic DNA 
following alkaline treatment in an RNase H2-deficient (Arnh201) background. 
Increased rates are seen with all three mutant polymerases (indicated by 
asterisk, as defined in Extended Data Fig. 3a), and are reduced in Pol-s’ which 
contains the point mutation Met644Leu, a mutation that increases selectivity 
for dNTPs over rNTPs”’. c, Quantification of average ribonucleotide 
incorporation in polymerase mutants from four independent experiments. 
DNA isolated from mid-log phase cultures; error bars denote s.e.m. Overall 
ribonucleotide content is the product of incorporation frequency and the total 
contribution of each polymerase, resulting in the total ribonucleotide content 
detected to be highest for Pol-e* (14,200 per genome), followed by Pol-6* 
(4,300 per genome), Pol-x* (2,700 per genome), POL (1,900 per genome) and 
Pol-s’ (860 per genome). d, Most of the yeast genome exhibits directional 
asymmetry in replication (median 4:1 strand ratio). Count of genomic 


segments calculated for consecutive 2,001-nucleotide windows over the yeast 
genome based on reanalysis of OF sequencing data'” denoted as ‘Okazaki-seq’. 
The strand asymmetry ratio was calculated after re-orienting all regions such 
that the predominant lagging strand was the forward strand. e-g, Genome- 
wide quantification of strand-specific incorporation of wild-type and mutant 
replicative DNA polymerases determined by emRiboSeq reflects their roles in 
leading- and lagging-strand replication. A close to linear correlation with 
Okazaki-seq strand ratios is observed. The strand ratio preference for lagging- 
strand ribonucleotide incorporation for independent libraries (including 
stationary phase libraries for POL and Pol-«*, marked by diamonds) was 
plotted against the lagging:leading-strand ratio determined using Okazaki-seq 
data (only ratios = 1:1 for the latter are shown for clarity). There was high 
reproducibility between experiments in strand ratio preferences. Lines are 
lowess smoothed (see Methods) representations of the full data sets 
(representative examples given in f and g). f, g, Scatter plots illustrating the 
individual strand ratio data points for 2,001-nucleotide windows, for stationary 
phase POL (f) and Pol-c* (g) yeast. Pearson’s correlation = 0.49, 

P<2.2X 10 '° for POL (f); correlation = 0.75, P< 2.2 10 *° for Pol-a* (g). 
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Extended Data Figure 5 | Pol-a-synthesized DNA retention is independent 
of RNase H2 processing of RNA primers. a, b, The ribonucleotide content of 
genomic DNA is unchanged between Arnh201 strains transformed with empty 
vector (—) or vector expressing Rnh201 separation-of-function mutant (sf), 
that retains the ability to cleave RNA:DNA hybrids, including RNA primers, 
but cannot cleave single embedded ribonucleotides”. In contrast, the same 
vector expressing wild-type Rnh201 (wt) fully rescues alkaline sensitivity of the 
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DNA. As complementation with the separation-of-function mutant had no 
detectable effect on the ribonucleotide content seen in the Pol-o(Leu868Met) 
Arnh201 strain, retention of Pol-c-synthesized DNA appears to be 
independent of a putative role for RNase H2 in RNA primer removal. 
Representative result shown for n = 3 independent experiments. c, Wild-type 
and mutant Rnh201 are expressed at equal levels, as shown by immuno- 
detection of the C-terminal FLAG tag. Loading control, actin. 
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Extended Data Figure 6 | Elevated substitution rates are observed adjacent quartile of sites (brown) is shown compared to a trinucleotide preserving 

to many human TF binding sites. a-d, Nucleotide substitution rates (plotted shuffle (black) based on the flanking sequence (100-300 nucleotides from motif 
as GERP scores) are elevated immediately adjacent to REST (a,b) and CTCF _ midpoint) of the same genomic locations. Brown dashed line and grey shading 
binding sites (c, d). Colour intensity shows quartiles of ChIP-seq peak height | denote 95% confidence intervals. e, Substitution rates plotted as GERP 


(pink to brown: lower to higher), reflecting strength of binding/occupancy. scores for human TF binding sites identified in ChIP-seq data sets (in 
Stronger binding correlates with greater increases of proximal substitution rate conjunction with binding site motif). Sites aligned (x = 0) on the midpoint of 
in the ‘shoulder’ region (asterisk). Increased substitution rates are not a the TF binding site within the ChIP-seq peak (colours as for a-d). Dashed 
consequence of local sequence composition effects (b, d). Strongest binding black line shows y = 0, the genome wide expectation for neutral evolution. 
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Extended Data Figure 7 | OJ and polymorphism rates are increased at yeast 
DNase I footprints. a, b, DNase I footprint edges correspond, genome-wide, to 
increased OJ rates and locally elevated polymorphism rates in S. cerevisiae (a), a 
pattern that is maintained when footprints associated with Reb1 and Rap1 
binding sites are excluded (b). Genome-wide DNase I footprints (n = 6,063) 
and excluding those within 50 nucleotides of a Reb1 or Rap] binding site 

(n = 5,136) were aligned to their midpoint. c, d, Aligning DNase I footprints on 
their left edge rather than midpoint (to compensate for substantial 
heterogeneity in footprint size) demonstrates a distinct shoulder of elevated 
polymorphism rate at the aligned edge (c), with a significant elevation 
compared to nearby sequence upstream from the footprint (d). DNase I 
footprints from a were aligned to their left edge (x = 0) with corresponding 
polymorphism rates shown (c). The increased polymorphism rate cannot be 
explained by local sequence compositional distortions (d). Nucleotide 
substitution rates in the 11 nucleotides centred on the DNase footprint edge 
(pink line), and another 11 nucleotides encompassing positions —35 to —25 
relative to the footprint edge (green line) were quantified. Darker pink and 
green filled circles denote the mean of observed substitution rates and lighter 
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shades denote the mean for the same sites after trinucleotide preserving 
genomic shuffles. Error bars denote s.d.; statistics by Mann-Whitney test. 

e, Model shows that correlation of increased nucleotide substitution and OJ 
rates are consistent with increased mutation frequency across heterogeneous 
DNase I footprints. Polymorphism is reduced at sequence-specific binding 
sites within the footprints, owing to functional constraint. Therefore, the effect 
of OF-related mutagenesis in these regions is most sensitively detected in 

the region immediately adjacent to the binding site (left of vertical dashed blue 
line, representing footprints aligned to their left edge). This ‘shoulder’ of 
increased nucleotide substitutions represents sites with increased, OJ- 
associated mutation is followed by a region of depressed substitution rates, 
owing to selective effects of the functional binding sites within the footprints (to 
the right of the dashed blue line). Signals further to the right are not 
interpretable given the heterogeneity in DNase I footprint sizes. Given strong 
selection at TF and DNase I footprint sites, this ‘shoulder’ of elevated nucleotide 
substitutions could represent a measure for the local mutation rate for 

such regions, analogous to that measured by the fourfold degenerate sites in 
protein coding sequence. 
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Extended Data Figure 8 | Model to show Pol-a DNA tract retention DNA ligase I. They demonstrated that if a protein barrier is encountered (grey 
downstream of protein binding sites. a, OF priming occurs stochastically, circle), Pol-5 progression is impaired, leading to reduced removal of the 

with the 5’ end of each OF initially synthesized by Pol-o and the remainder of | downstream OF (b). Given that ~1.5% of the mature genome is synthesized by 
the OF synthesized by Pol-é. b, c, OF processing: when Pol-6 encounters the _—_ Pol-w, a proportion of lagging strands will retain Pol-a-synthesized DNA (red). 


previously synthesized OF, Pol-6 continues to synthesize DNA displacing When Pol-é progression is impaired by protein binding, this will lead to an 
the 5’ end of the downstream OF, which is removed by nucleases to result increased fraction of fragments containing Pol-c-synthesized DNA 
in mature OFs which are then ligated. The OJs of such mature OFs before downstream of such sites (c). 


ligation were detected previously’’ after depletion of temperature-sensitive 
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Crystal structure of the V(D)J 
recombinase RAGI-RAG2 


Min-Sung Kim’, Mikalai Lapkouski'+, Wei Yang! & Martin Gellert’ 


V(D)J recombination in the vertebrate immune system generates a highly diverse population of immunoglobulins and 
T-cell receptors by combinatorial joining of segments of coding DNA. The RAGI-RAG2 protein complex initiates this 
site-specific recombination by cutting DNA at specific sites flanking the coding segments. Here we report the crystal 
structure of the mouse RAGI-RAG2 complex at 3.2A resolution. The 230-kilodalton RAGI-RAG2 heterotetramer is 
“Y-shaped’, with the amino-terminal domains of the two RAGI chains forming an intertwined stalk. Each RAG1-RAG2 
heterodimer composes one arm of the ‘Y’, with the active site in the middle and RAG2 at its tip. The RAG1-RAG2 structure 
rationalizes more than 60 mutations identified in immunodeficient patients, as well as a large body of genetic and 
biochemical data. The architectural similarity between RAGI and the hairpin-forming transposases Hermes and Tn5 
suggests the evolutionary conservation of these DNA rearrangements. 


To combat the great range of possible infectious agents, the vertebrate 
immune system deploys a highly diverse population of immunoglob- 
ulins and T-cell receptors. In many species this diversity is generated 
by V(D)J recombination’. By combinatorial joining of segments of 
coding sequence, V(D)J recombination is capable of assembling mil- 
lions of different functional immunoglobulin and T-cell receptor genes’. 
This recombination is initiated by DNA double-strand breaks produced 
by the RAG1-RAG2 recombinase, at sites flanked by specific recom- 
bination signal sequences (RSS). The RSS are of two types, with either 
12 or 23 non-conserved nucleotides between conserved heptamer and 
nonamer modules; one RSS of each type is strictly required for recom- 
bination’. The two RSS varieties are partitioned so as to focus recom- 
bination on V toJ, or V to D toJ, joining. RAG] and RAG2 are the only 
lymphoid-specific factors involved in V(D)J recombination*”, while the 
resulting hairpinned coding ends are processed by general repair factors 
of the non-homologous end-joining pathway”®. 

Since the identification of the RAGI and RAG2 genes”*, RSS- 
dependent DNA cleavage by purified RAG1-RAG2 has been recon- 
stituted’. RAG1 and RAG2, of 1,040 and 527 residues respectively, 
cooperate in all their known activities. The catalytic core, regulatory 
regions, active site residues, DNA-binding domains, two zinc-binding 
motifs, and some aspects of the interface of RAG] and RAG2 have been 
characterized**. It was also found that RAG1-RAG2 can function 
in vitro as a transposase’"', inserting RSS-terminated DNA into a sec- 
ond DNA molecule. Moreover, a large number of human mutations in 
both RAG proteins that cause severe combined immunodeficiency (SCID) 
or a milder form known as Omenn syndrome have been identified’*”. 

Biochemical and functional studies have shown that portions of RAG1 
and RAG2 can be deleted, and the ‘core’ proteins, residues 384-1008 of 
RAG] and 1-387 of RAG2, retain targeted cleavage activity in vitro and 
recombination activity (although not fully regulated) in cells'*”. An 
earlier low-resolution electron microscopic study of the core complex, 
containing two subunits each of RAG] and RAG2 bound to a 12RSS 
and 23RSS DNA pair, revealed the overall shape and localization of 
RAG proteins'®, Here we report the 3.2 A crystal structure of the RAG1- 
RAGZ2 heterotetramer and its implications for V(D)J recombination. 


SEC complex and structure determination 

The catalytic cores of mouse RAG1 (384-1008 amino acids) and RAG2 
(1-387 amino acids) with maltose binding protein (MBP) fused to their 
N termini were expressed in HEK293T cells and readily purified (Me- 
thods). RAG1-RAG2 was assembled with pre-cleaved 12RSS and 23RSS 
DNAs in the presence of HMGB1 to forma signal-end complex (SEC)”’, 
and the purified SEC after removal of the cleaved MBP tags and HMGB1 
(Fig. 1a, b) was homogeneous and active in strand transfer (Extended 
Data Fig. 1). 

Crystals were grown over a period of 2-4 weeks (Methods). For phase 
determination, methionines in the RAG1-RAG2 proteins were substi- 
tuted by selenomethionine to a level of 40% (Methods). Single-wavelength 
anomalous diffraction (SAD) data sets of high redundancy were col- 
lected at the Se absorption peak from six crystals. Fifty-four of fifty- 
eight selenium sites were located, together with two Zn’ atoms (one in 
each RAG1). The electron density map using all SAD data, nominally 
at 3.7 A, was superior to that calculated using only the two best sets ac- 
cording to anomalous correlation coefficient” (Fig. 1c, dand Extended 
Data Fig. 1a). The heterotetramer of RAG1-RAG2 recombinase was 
readily traceable (Extended Data Fig. 1b). Although 12RSS and 23RSS 
DNAs were included in the SEC complex and were also present in dis- 
solved crystals (Extended Data Fig. 1c, d), DNA was not found in the 
electron density map. Only the four protein chains, with residues 391— 
1008 of RAG1 and 2-350 of RAG2, were modelled and refined to 3.2A 
(Extended Data Table 1). The carboxy-terminal 37 residues of RAG2 
are disordered. In fact, RAG2 (1-351) forms active heterotetramers 
with RAGI in vitro (Extended Data Fig. 2), and supports V(D)J recom- 
bination in cells”’. 


Architecture of RAGI-RAG2 


The RAG1-RAG2? crystal structure is remarkably similar to the low- 
resolution model generated from two-dimensional averaging of nega- 
tively stained electron microscopy images"*. It is Y-shaped (125 AX 
150 A x 90 A), with the RAGI dimer forming the bulk and RAG2 situ- 
ated at the tip of each arm (Fig. 2). There is an oval-shaped gap (40 Ax 
60 A) separating the two RAGI-RAG2 heterodimers (Fig. 2). RAG] is 


1Laboratory of Molecular Biology, NIDDK, NIH, Bethesda, Maryland 20892, USA. Present address: Department of Cell and Molecular Biology, Karolinska Institute, 171 77 Stockholm, Sweden, and Centre 


for Structural Systems Biology, DESY, 22607 Hamburg, Germany. 
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Figure 1 | Structure determination of RAGI-RAG2 recombinase. 

a, Procedure of assembling SEC from purified RAG1-RAG2, RSS DNAs and 
HMGBI. The MBP tags were cleaved off by PreScission protease after SEC 
formation. b, The RAG1-RAG2-DNA complex was purified away from MBP 
tags, free DNA and HMGB1 on a Superdex-200 column. The eluted peak 
contains RAG1-RAG2 protein and RSS DNAs, as shown in the protein and 
DNA denaturing gels stained by Coomassie blue and SYBR green, respectively. 
c, d, The experimental electron density map calculated from merging two best 
SAD data sets (c) or all six SAD data sets (d). 


elongated (100 A) and composed of seven structural modules (Fig. 3a). 
The N-terminal nonamer-binding domain (NBD), which superimposes 
well with the structure determined previously”, forms a domain-swapped 
and intertwined dimer. The following dimerization and DNA binding 
domain (DDBD) is connected to the NBD by a flexible linker, and the 
last helix of the three-helix C-terminal domain (CTD) folds back to 
complete the DDBD (Fig. 3b), which may be why previous domain dis- 
sections failed to isolate this structural entity. Three conserved carbox- 
ylates in RAG1 (D600, D708 and E962) have been identified as being 
essential for catalysis**-**, but it was not clear how the catalytic domain 
would fold owing to their large separation in the amino-acid sequence. 
Our structure shows that RAG1 adopts an RNase H fold with an elon- 
gated central four-stranded -sheet, similar to other DDE transposases 
(named after the three catalytic carboxylates, which are all within the 


RAG2 RAG2 


Figure 2 | Crystal structure of RAGI-RAG2. a, b, Front (a) and top (b) view 
of the RAG1-RAG2 heterotetramer. The two RAGI chains are shown in 
blue and green ribbon diagrams, and both RAG2 subunits are shown in 
magenta. The active sites are highlighted by the three carboxylates shown as red 
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same RAGI subunit) (Fig. 3)°*°*. Following the DDBD, the extended 
pre-RNase H (preR) and catalytic RNase H (RNH) domains further 
lengthen RAGI to ~100 A. Each active site is located in the middle of a 
Y arm, and the two are about 45 A apart (Fig. 2). Two domains inter- 
vene between D708 and E962: ZnC2, which protrudes towards RAG2, 
and the highly helical ZnH2, which increases the third dimension of RAG1 
from 25 A to 65 A and eventually brings E962 back to the catalytic cen- 
tre (Fig. 3a—c). Two regions for Zn?” binding were previously iden- 
tified”?*°. The first (C727, C730) and second (H937 and H942), despite 
being far apart in the sequence, form one zinc-binding site (Fig. 3c, d) 
that juxtaposes the catalytic centre, interface with RAG2 (ZnC2), and 
DNA binding (DDBD and ZnH2) domains. 

RAG? is folded into a six-bladed B-propeller, or Kelch-repeat struc- 
ture, as predicted*’*? (Extended Data Fig. 3). The first N-terminal 
B-strand belongs to the sixth 4-stranded B-blade and ties the first and 
last Kelch repeats together. Compared with other B-propeller struc- 
tures, the six blades of the doughnut-shaped RAG2 are more distorted 
in planarity and spacing between blades (Extended Data Fig. 3). These 
distortions are most notable at the interface with RAGI (Fig. 4a). As 
usual for B-propeller proteins, one face of RAG2 has extended loops. 
Many loops, particularly those connecting adjacent blades or the mid- 
dle two strands of each blade, are involved in interacting with the preR, 
RNH and ZnC2 domains of RAG1 (Fig. 4b, c). The interface between 
RAGI and RAG? is highly conserved from fish to humans, encompass- 
ing both polar and hydrophobic interactions and dovetailed with ridges 
and canyons (Fig. 4). Interestingly, RAG2 contacts RAGI near the active 
site*’, including E607 and V615 (connected by a disordered loop) and 
E719 to V724 contacted by the long RAG2 loop 335-339 and residue 
R39, respectively (Figs 2 and 4c). It is likely that in the presence of DNA 
substrate, RAG2 assists in formation of the catalytic site and DNA 
cleavage. 


SCID and Omenn syndrome mutations 


Over 60 missense mutations leading to SCID or Omenn syndrome, a 
milder type of immunodeficiency, due to defective DNA processing in 
V(D)J recombination have been mapped to the catalytic cores of RAG1 
and RAG2 (refs 12, 13). The disease-involved residues are identical 
between human and mouse RAG proteins, except for M435 in human 
RAGI being replaced by L432 in mouse. Residue numbers in mouse and 
human RAGI differ by three (Extended Data Table 2). For consistency, 
all residues are numbered here according to the mouse protein that we 
are studying. The SCID and Omenn syndrome mutations fall into four 
classes. The first class of mutations clearly destabilizes the tertiary struc- 
ture of RAG1-RAG2Z. For example, mutations of the zinc-binding site, 
C727E and the adjacent L729F (Fig. 3d), would perturb the structure 


sticks. The zinc ions are shown as dark red spheres. The distance between the 
two active sites is ~45 A (marked by the red double arrowheads). The 
disordered loop of residues 608-614 in RAGI near the RAG1-RAG2? interface 
is marked by a dotted line and grey arrowhead. 
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Figure 4 | The interface between RAGI and RAG2. a, One side of the 
doughnut-shaped RAG2 interacts with the preR, RNH and ZnC2 domains of 
RAGI (colour-coded as in Fig. 3). The SCID/Omenn syndrome mutations 

in RAG2 are shown and labelled in black. b, An orthogonal view of the RAG1- 
RAG? interface. The Kelch repeats of RAG2 are labelled I to VI. c, Based on 
b, the interface of RAG1 and RAG2 is shown as an open book. The regions 
at the interface are indicated according to the colour code, and SCID/Omenn 
syndrome mutations are labelled in black. The mirrored arrowheads and 
boxes indicate matching surfaces. 


ARTICLE 


Figure 3 | The RAGI structure. a, A diagram of 
RAGI1 domains with boundaries indicated by 
residue numbers. b, Cartoons of RAG1 dimer. One 
subunit is colour-coded as in a, and the other in 
silver except for the preR and ZnC2 domains. 

c, An orthogonal view of b. d, The Zn?* 
coordination by two Cys (of ZnC2) and two His 
residues (of ZnH2). e, The DDBD and CTD 
domains. f, The RNH and a portion of preR 
domain with the catalytic carboxylates shown in 
red sticks. In panels b, d-f, SCID/Omenn 
syndrome mutations are shown as coloured 
sticks with black labels. 


962 1008 


of the ZnC2, ZnH2 and adjoining domains. Three immunodeficiency 
mutations—W893R, Y909C and 1953R—probably destabilize the hy- 
drophobic core next to the zinc-binding site (Fig. 3d), reflecting the 
importance of this region in the overall structure of RAG1-RAG2. A 
number of laboratory-generated mutations with loss of function also 
are likely to destabilize the structure of certain domains in RAGI, for 
example W893A. 

The second class includes polar residues exposed to solvent and con- 
centrated in two areas likely for DNA binding. Seven out of nine im- 
munodeficiency mutations found in the DDBD and CTD domains are 
at a cluster of polar residues (Arg, Lys, Ser or Gln) (Fig. 3e). Systematic 
mutation of positively charged residues in RAG] (ref. 34) also showed 
that K966, R969, R977 and H990 in CTD contribute to DNA binding 
and cleavage. In parallel, 12 SCID/Omenn syndrome mutations in the 
NBD domain (Fig. 3b, Extended Data Fig. 4), which overlap with many 
highly conserved non-polar and polar residues, are involved in struc- 
tural integrity and sequence-specific binding to the nonamer DNA”. 

The third class of mutations is clustered around the active site (Fig. 3f). 
Some may alter the structure of the catalytic centre (S598P, C599W, 
A619P, R696Q/W, G706D), and others may change its DNA-binding 
properties (E666G, R621C/H, R713W),. It is not surprising that engi- 
neered mutations of conserved polar residues surrounding the active site 
(E597, E709, D792, H795 and E959) lead to defects in DNA cleavage”***. 

The last class of SCID/Omenn syndrome mutations is located at the 
interface of RAG] and RAG2. Four of six disease mutations in the RAG2 
core are concentrated at the subunit interface (Fig. 4). G35V, R39G and 
C41W are at the interface with the ZnC2 and RNH domains of RAG1. 
R229 forms salt bridges with D546 (preR) of RAGI, which has been 
accurately mapped to the interface with RAG2 (ref. 35). G95R is at the 
base of a long loop that reaches to the RNH domain and sandwiches 
E666 (which itself is mutated in people with SCID) with G35 (Fig. 4a). 
The importance of the subunit interface is also evident in the disease 
mutations located on the RAGI side (R556S, R558C/H, E666G and 
R773Q) and additional mutations identified in laboratories (E719Q and 
R773A)*>** (Extended Data Table 3). 
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A model of the RAGI-RAG2—RSS DNA complex 


Structures of five DDE transposases complexed with DNA have been 
reported***°. They can be segregated into two groups based on the 
markedly different DNA orientations relative to the catalytic dimer 
(Extended Data Figs 5 and 6). Hermes (a member of the hAT trans- 
posase family) and Tn5, both of which generate a hairpin intermediate 
during DNA processing, belong to the same group, and both contain 
an o-helix extending from one catalytic centre (by contributing the last 
catalytic residue Glu) to contact the DNA bound to the other subunit 
(Extended Data Fig. 5). Very similar architectural features are found in 
the RAGI dimer. After superimposition of the RNH domain of RAG1 
and Hermes (Fig. 5a), not only does the 16-base-pair (bp) DNA co- 
crystallized with Hermes fit into the active site of RAGI, but the second 
DNA of Hermes is close to the other RAGI catalytic centre (Fig. 5b). The 
first 7 bp of the DNA may mimic the heptamer of each RSS (Fig. 5b). 
Remarkably, the two DNAs modelled in RAGI are connected by the 
a-helix (964-975 amino acids) that immediately follows the catalytic 
residue E962, just as is found in Hermes and Tn5. None of the aromatic 
residues previously suggested as functioning in DNA hairpin forma- 
tion*' is situated near the active site. G851 and N852 in the ZnH2 
domain of RAG] appear to replace W319 in the equivalent helical inser- 
tion domain of Hermes that stacks on the 3’-end DNA base (Fig. 5b). 
In corroboration of this model, mutations of N852 or residues 970-978 
in the CTD are implicated in Omenn syndrome (Figs 3e and 5b). 
The rest of the RSS DNA can also be modelled based on the pub- 
lished crystal structure of an isolated NBD bound to a 12-bp nonamer 
DNA” (Extended Data Fig. 4). From DDBD to CTD, the two halves of 
the RAG1-RAG2 tetramer are rather symmetric (Fig. 5c), but the NBD 
domains are not related by the same dyad. Although perfectly sym- 
metrical internally, the intertwined NBDs are tilted relative to the rest 
of the protein. As a result, the two nonamer DNAs bound to the NBDs 
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Figure 5 | A RAGI-RAG2-DNA model. a, Superposition of the RNH 
domains of RAGI (red active site) and Hermes (Protein Data Bank (PDB) 
4D1Q (ref. 37)). b, The superposition places the 16-bp DNA of the Hermes- 
DNA complex in the RAG] active site. The DNA is coloured yellow (the 
first 7 bp) and gold. The second DNA is shown without additional 
manipulation. The a-helices bridging the two DNAs are coloured lilac. c, The 
RAG1-RAG2-RSS DNA model resulting from superposition with the Hermes 
and the NBD-DNA complex (12 bp, PDB 3GNA)”. d, DNA cleavage by 
hairpin-forming bacterial and eukaryotic transposases. The recognition 
sequences are represented by orange triangles. 
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would be oriented differently to the two catalytic centres. One is very 
close to the 16-bp DNA modelled into the RNase H domain (Fig. 5c). 
Notably, the sum of the two DNA segments is about 28 bp, the total 
length of 12RSS DNA. In contrast, the other pair of DNA segments is 
separated by ~30 A, which may mimic the 23RSS with its additional 
11 bp connecting the nonamer and heptamer (see Supplementary Video 1). 
In this model, the nonamer and heptamer ends of each RSS interact 
with different RAG1 subunits in a trans configuration, as mutational 
study has suggested’’. A sharp kink is unavoidable in each RSS DNA as 
modelled (Fig. 5c), and HMGBI could stabilize such kinks to facilitate 
the gene rearrangement*™. 

The surface of RAG1-RAG2 traversed by the modelled DNAs is both 
highly positively charged and highly conserved (Extended Data Fig. 7). 
The only exception is the NBD, which is not as highly charged as the 
DDBD and ZnH2 regions. This may correlate with the sequence- 
specific recognition of the nonamer. Beyond the regions that are mod- 
elled to bind RSS DNAs, extensive surface areas along the rim of the Y 
arms from RAG2 to ZnH2 are positively charged and partially con- 
served (Supplementary Video 1). These areas could bind up to 20 bp of 
coding DNA flanking the RSS. Although 6-bp coding flanks can be 
slowly cleaved by RAG1-RAG2, efficient cleavage requires more than 
15 bp of flanking DNA”. Interactions of the coding flanks with the top 
of the RAG1-RAG2 complex may explain why many mutations in the 
RAGI and RAG? interface have an impact on DNA cleavage”’. 


Concluding remarks 


The structure of RAG1-RAG7? reveals the architecture of the complex 
and the composition of its functional sites. It rationalizes the effects of 
many mutations associated with human immune deficiencies. Evolu- 
tionarily, eukaryotic hAT transposases and RAG1-RAG2 recombinase, 
which cleave DNA duplex in two steps and leave a hairpin on the flank- 
ing DNA, are thought to be rather different from the bacterial transpo- 
sases that cleave DNA in three steps and form a hairpin intermediate 
on the recognition DNA (Fig. 5d). The similar enzyme-substrate asso- 
ciation found in Hermes and Tn5 and their structural relationship with 
RAGTI has led us to propose that the two DNA recombination processes 
are identical in mechanism and configuration, differing only in the 
nucleophiles used at each step, a water molecule versus a 3'-OH. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


No statistical methods were used to predetermine sample size. 

Protein expression and purification. Mouse core RAG1 (384-1008 amino acids) 
and RAG2 (1-387 amino acids) were cloned into the pLEXm-based*° mammalian 
expression vector, modified with an N-terminal Hisg tag followed by MBP tag and 
a PreScission cleavage site (LEVLFQ/GP) (where ‘/ indicates the cleavage site). The 
N-terminal Met of RAG2 was mutated to Val (M1V) during cloning. To express the 
RAG1-RAG2 complex in HEK293T cells, 500 ig of each of the RAG1 and RAG2 
expression plasmids were mixed with 4 mg of polyethylenimine (Polysciences) in 
35 ml of Hybridoma medium (Invitrogen) to transfect 1] of HEK293T cells grown 
in suspension culture in Freestyle 293 medium (Invitrogen), supplemented with 1% 
FBS when the cell density reached 1.5 million per millilitre. Four days after trans- 
fection, cells were harvested and stored at —80 °C. Cell paste (~8 g from 11 culture) 
was re-suspended in 50 ml of lysis buffer containing 20 mM HEPES (pH 7.3), 1M 
KCI, 1 mM Tris (2-carboxyethyl) phosphine (TCEP) (pH 7.0), 1 mM EDTA and 
protease inhibitor cocktail (Roche), and lysed by sonication. After centrifuging at 
35,000 r.p.m. for 1 h, the clarified lysate was mixed with 5 ml ofamylose resin (NEB), 
which was pre-equilibrated with the lysis buffer and incubated with rotation for 
1h. After pouring the resin into a column and washing thoroughly with 200 resin 
volumes of the lysis buffer, the RAG1-RAG2 protein was eluted by an amylose 
elution buffer containing 20 mM HEPES (pH 7.3), 500 mM KCl, 40 mM maltose 
and 1mM TCEP. The eluted protein, which consisted mainly of RAGI-RAG2 
heterotetramer*’ was concentrated and stored at —80 °C after adding glycerol to 
20% final concentration. In contrast to RAG] and RAG2 core proteins expressed 
in insect cells, which required activity-based purification for structural studies’®, 
RAG1-RAG2 expressed in human cells is highly active. Human HMGB1 (1-163 
amino acids) was prepared as reported previously'*””. 

To make the SEC complex, purified RAG1, RAG2, 12RSS and 23RSS DNAs were 
mixed at 2:2:1:1 molar ratio in the presence of a 2-fold excess of HMGB1 with the 
buffer of 20 mM HEPES (pH 7.3), 150 mM KCI, 5 mM CaCl,, and 1 mM TCEP and 
incubated for 1 h at 37 °C. After removing the MBP tag by addition of PreScission 
protease (1:100 mass ratio of protease to substrate) overnight at 4 °C, further puri- 
fication was performed using gel filtration (Superdex 200, GE Healthcare) in 20 mM 
HEPES (pH 7.3), 500 mM KCl, 5mM CaCl, and 1mM TCEP, which removed 
HMGB1 along with free DNA, the MBP tag, and PreScission protease. All puri- 
fication steps were performed at 4 °C. HMGBI could be retained with RAG1-RAG2 
and RSS DNA if KCl concentration was reduced to below 100 mM (Extended Data 
Fig. 2), but we were unable to crystallize SEC completed with HMGB1. To prepare 
selenomethionine (SeMet)-labelled RAG1-RAG2 complex, HEK293T cells were 
transferred after transfection to methionine-free Freestyle 293 medium (Invitro- 
gen) supplemented with 25 mg!” ' L-SeMet (Acros Organics) and 1% dialysed FBS 
(Invitrogen). Three days later, cells were collected and protein was purified in the 
same way as native protein. Mass spectrometry analysis of trypsin-digested SeMet- 
labelled RAG1-RAG2 peptides was performed at the Taplin Mass Spectrometry 
Facility (taplin.med.harvard.edu). It showed that about 40% of methionines were 
substituted by SeMet. 

Crystallization and data collection. Crystals of the RAG1-RAG2-DNA complexes 
were grown by the hanging-drop vapour diffusion method at 4 °C over 3 weeks. 
Equal volumes of protein (~5 mg ml ') and reservoir solution containing 100 mM 
MES (pH 7.1), 10-15% PEG 3350, 200 mM tribasic ammonium citrate (pH 7.0) and 
100 mM KCl were mixed in each droplet. Crystals were cryo-protected in reservoir 
solution supplemented with 25% ethylene glycol and flash frozen in liquid nitrogen. 
We were able to crystallize tetrameric RAG1-RAG2 alone as well as RAG1-RAG2 
with a single 12RSS or 23RSS, but these crystals were small, and none of them 
diffracted X-rays as well as the SEC complex. Crystals of SeMet-labelled RAG1- 
RAG2 complex were grown under similar conditions. Native and SeMet-labelled 
complex both crystallized in the C222, space group with two RAG] and two RAG2 
(one RAG1/2 heterotetramer) in each asymmetric unit. Data were collected at 100 K 
for native and SeMet-derivative crystals at beam lines 22ID and 23ID of the Advanced 


Photon Source (APS) at Argonne National Laboratory. All data were indexed, inte- 
grated and scaled with the XDS package** (Extended Data Table 1). 

Structure determination and refinement. Phases were determined by the single- 
wavelength anomalous diffraction (SAD) method and multi-crystal averaging”. 
SAD data were collected from six SeMet-substituted crystals with the best resolu- 
tion of 3.7 A (Extended Data Table 1). Data were processed according to the pub- 
lished procedure” (Extended Data Fig. 1a). Selenium sites were identified using 
SHELXD” and refined with PHASER”. Out of the 58 highest anomalous peaks, 
54 corresponded to selenium sites, and two were Zn? * ions. Phases were improved 
by density modification using RESOLVE" and the overall figure-of-merit was 0.79. 
The RAG1-RAG2 model was built manually in COOT™. Although the experimental 
electron density map contained breaks in the main chains, and side-chain defini- 
tion was not perfect, the register of the polypeptide chains was readily determined 
based on the SeMet sites. This initial model was refined in Phenix*’ and manually 
improved using COOT. Secondary structure restraints and non-crystallographic 
twofold symmetry averaging restraints were used throughout the refinement. The 
RAGI1-RAG? structure was refined to 3.2 A with Rworcand Ries of 20.6% and 25.9%, 
respectively (Extended Data Table 1). The quality of the structure was validated with 
MolProbity**. 90.7% of residues are in the favoured regions of the Ramachandran 
plot, 9.3% in additional allowed regions, and no residue in the disallowed region. 
The final model contains amino acids 391-1008 of RAG] and amino acids 2-350 
of RAG2, and one Zn”* ion in each RAG1. The N-terminal residues (391-404) are 
ordered in one RAG] subunit. Owing to poor electron densities, residues 608-616 
of RAGI and 82-88, 242-244, 254-255 and 334-337 of RAG2 were not included 
in the final model. Crystal packing of two neighbouring RAG1-RAG2 tetramers 
appears to occlude one nonamer-binding site in each RAG1-RAG2 complex (Ex- 
tended Data Fig. le). No water molecules were added. All structure figures were 
prepared with PyMOL (http://www.pymol.org), and sequence conservation ana- 
lysis was performed using ClustalW”. 
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Extended Data Figure 1 | Structure determination. a, The plot of correlation 
coefficient (CC) of anomalous signal versus resolution. The red line indicates 
the cutoff of CC = 0.3. Merging data from the two best crystals produced a 
better CC than merging data from all six crystals. The data processing 
procedure is outlined above the plot’. b, The SAD experimental map 
contoured at 1.30 showed the content of an asymmetric unit. The Se 
anomalous map is contoured at 3.00 in red. ¢, A typical crystal of RAGI- 
RAG2. d, The content of crystals was examined by protein and DNA 
denaturing gels after a thorough wash of the crystals and stained by Coomassie 
blue and SYBR green. To confirm the 1:1 molar ratio of 12 and 23RSS DNA, 


=1to4 
==1to5 
==1to6 


>?P-labelled input RSS DNAs and those in SEC complexes before and after 
crystallization are shown beneath the SYBR-green-stained DNA gel. 

e, Transposition assay of the purified SEC (RAG1-RAG2-12/23RSS DNA 
complex) used for crystallization. Supercoiled pUC19 (sc, with a small amount 
of open circle, oc) was the target; it was linearized by HindIII as a control. 
The SEC (0.25, 0.5 and 1.0 1M) was active in concerted transposition and thus 
linearizing pUC19. In contrast, RAGI-RAG2 or HMGBI (0.5 1M) each alone 
was not active. f, Crystal packing of neighbouring RAG1-RAG2 complexes 
(shown in dark and light colours) occludes one nonamer-binding site in 

each heterotetramer of RAG1-RAG2. 
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Extended Data Figure 2 | RAG2 core fragment (1-351 amino acids) is 
active. a, Sequence alignment of RAG2 from mouse (320-387 amino acids), 
human, rat and Xenopus with predicted secondary structures shown above. 
b, Core RAG2 (1-387) and two further truncated RAG2 variants (1-351 and 
1-367) were constructed with a non-cleavable N-terminal MBP tag and 
co-expressed with the tag-less core RAG1. The Coomassie blue R-250 stained 
SDS gel shows the purified RAGI-RAG2 complexes. c, Purified RAG1- 
RAG2 complexes with truncated RAG2 variants are equally active in cleaving 
a >P-labelled 12RSS DNA (in the presence of a 23RSS and Mg**, as examined 
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by TBE-Urea gel). d, Elution profiles of RAG1-RAG2 (both long and short 
forms) complexed with DNA from Superdex-200 (S200) in a low salt 

buffer (50 mM HEPES pH 7.0, 60mM KCl, 1 mM maltose and 2mM DTT). 
Regardless of the length of RAG2, the major S200 eluant peak came out at 
the same time point and contained RAGI, RAG2 (1-351 or 1-387) and 
HMGBI proteins, as shown in the SDS gel (right insert), as well as 12 and 23RSS 
oligonucleotides, as confirmed by a TBE-Urea gel stained by SYBR green 
(left insert). 
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Extended Data Figure 3 | Comparison of RAG2 with f-propeller and orthogonal views. Each structure is coloured from N to C terminus in blue 
B-pinwheel structures. KLHL2 (PDB 4CHB)* is selected to represent the to red rainbow colours. The loops in RAG2 that interact with RAGI are 
B-propeller proteins, and the C-terminal domain (CTD) of GyrA (PDB labelled. The six B-blades are named by Roman numerals, I-VI, from N to C 
1SUU)” is selected to represent the B-pinwheel structures. After superposition, terminus; four B-strands in each blade are named by Arabic numerals, 1-4. 
RAG2 (a), KLHL2 (b) and GyrA (c) are shown side-by-side individually in two 


KLHL2 GyrA CTD 
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Extended Data Figure 4 | Comparison of RAGI and NBD-DNA complex. — complex. Six SCID/Omenn syndrome (R391 to R407) mutations are located on 
a, The NBD in the RAG1-RAG2 core complex (blue and green) superimposes _a positively charged surface patch that interacts with the nonamer; five 

well with the published structure of the NBD-DNA complex (PDB 3GNA, remaining SCID/Omenn syndrome mutations (L408 to A441) appear to 
protein coloured yellow)”. b, The twelve SCID/Omenn syndrome mutationsin _affect the structural integrity of the NBD, and R446 may interact with the spacer 
the NBD domain are mapped onto the crystal structure of the NBD-DNA DNA in each RSS. 
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Extended Data Figure 5 | Transposases that form a hairpin intermediate. 
a-c, Hermes (PDB 4D1Q)”’ (a), bacterial Tn5 (PDB code 1MUS)” (b), and 
RAGI dimers (c) are shown as ribbon diagrams in two orthogonal views, with 
the dyad perpendicular to the viewing plane (left) or in the plane (right). 
Each dimer consists of a cyan and a green subunit. The catalytic RNH domains 
are highlighted in pink, and the conserved catalytic residues are shown as 
red ball-and-sticks. The catalytic divalent metal ions are shown as green spheres 
if present. The DNAs, coloured in yellow (cleaved by the cyan subunit) and 
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Hermes 


orange (cleaved by the green subunit), have similar orientations in the Hermes 
and Tn5 complexes (as indicated by the arrows). Arrows with dashed outlines 
indicate that the DNAs are in the back of the viewing plane. Notably, the 
pair of RNH domains is oriented similarly in all three cases. The predicted 
orientations of DNAs bound to RAG] are indicated by the yellow and orange 
arrows, and the o-helices connected to the third catalytic carboxylates 
(shown in light purple) probably bridge two DNAs in RAGI recombinase as in 
Hermes and Tn5. 
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Extended Data Figure 6 | Transposases that do not form a hairpin 
intermediate. a-c, Retroviral integrase from Prototype foamy virus (Pfv, PDB 
30S0)* (a), bacterial MuA transposase (PDB code 4FCY)** (b) and eukaryotic 
Mos1 mariner transposase (PDB 3HOT)” (c) are shown in comparable 
views and same representations as Hermes, Tn5 and RAG1-RAG? in Extended 
Data Fig. 5. Each catalytic dimer consists of a cyan and a green subunit. Two 
accessory subunits in Pfv are shown in light blue and green, and two accessory 
subunits of the MuA structure are omitted for clarity. The catalytic RNH 


Pfv Integrase 


domains are highlighted in pink. The DNAs, coloured in yellow (cleaved by the 
cyan subunit) and orange (cleaved by the green subunit), have similar 
orientations (within 30°) as indicated by the arrowheads, but each differs 
more than 90° from the corresponding DNA in Hermes or Tn5 transposase. 
The grey DNA in the MuA complex represents the target of transposition. 
Among these three recombinases, the «-helix that follows the third catalytic 
carboxylate (coloured in light purple) does not cross over to interact with a 
second DNA. 
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Extended Data Figure 7 | Surface potential and conservation of RAGI- 
RAG2 complex. a, Orthogonal views of the electrostatic potential surface of the 
RAGI1-RAG2 structure. Blue indicates positive charges, and red negative. 

b, Orthogonal views of the molecular surface of RAGI-RAG2 with absolutely 


conserved residues highlighted in deep purple. The NBD is well conserved. 
The views with dyad in the plane here are related to the image shown in Fig. 5c 
by ~50° rotation around the dyad. 
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Extended Data Table 1 | Statistics of native and SeMet SAD data collection and structure refinement 


Native 


Crystal #1 Crystal #2 Crystal #3 Crystal #4 Crystal #5 Crystal #6 

Space group C222, C222, C222) C222) C222, C222, C222, 
Cell dimensions 

a, b,c (A) 168.8, 180.1, 200.2 168.7, 179.0, 199.3 168.5, 179.2, 200.3 169.1, 180.1, 200.8 168.7, 180.3, 202.3 169.4, 179.3, 199.7 169.1, 179.6, 200.0 

a, By () 90, 90, 90 90, 90, 90 90, 90, 90 90, 90, 90 90, 90, 90 90, 90, 90 90, 90, 90 
Absorption (Se) Peak Peak Peak Peak Peak Peak 
Wavelength (A) 1.0000 0.97918 0.97918 0.97918 0.97918 0.97913 0.97918 
Resolution’ (A) 50-3.2 50.0-3.8 50.0-3.7 50.0-4.0 50.0-3.9 50.0-3.8 50.0-3.9 

(3.31 —3.2) (3.94-3.8) (3.83-3.7) (4.14-4.0) (4.04-3.9) (3.94-3.8) (4.04-3.9) 

Riwss” 0.105 (0.58) 0.151 (0.778) 0.149 (0.831) 0.199 (0.859) 0.194 (0.796) 0.174 (0.873) 0.200 (0.977) 
Tlol* 12.75 (2.23) 17.7 (4.6) 17.5 (3.9) 14.4 (4.1) 14.2 (4.5) 16.8 (4.5) 13.1 (3.8) 
Completeness * (%) 98.82 (99.9) 100.0 (100.0) 100.0 (100.0) 100.0 (100.0) 100.0 (100.0) 100.0 (100.0) 100.0 (100.0) 
Redundancy * 7.1.4) 15.0 (15.3) 15.0 (15.3) 14.8 (15.2) 14.9 (15.3) 15.0 (15.3) 15.0 (15.3) 
Refinement 
Resolution (A) 50 -3.2 
No. reflections 49907 
Ryo! Ricee 0.206 / 0.259 
No. atoms 

Protein 14976 

Ligand/ion (Zn?") 2 

Water 0 
B-factors 

Protein 106.4 

Ligand/ion 85.0 

Water - 
R.m.s deviations 

Bond lengths (A) 0.007 

Bond angles (°) 1.114 


Asterisk indicates that data in the highest resolution shell is shown in parenthesis. 
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Extended Data Table 2 | Missense mutations of RAG1 and RAG2 identified in human SCID/OS patients 


.._Human mutation Mouse residue 


RAGI 
R394W/Q 
R396L/H/C 
S401P 
T403P 
R404W/A/Q 
R410Q/W 
L411P 
D429G 
V433M 
M435V 
A444V 
R449K 
L454Q 
R474S/H/C 
$480G 
LS06F 
R507W 
GS16A 
W522C 
D539V 
R5598 
R561H/C 
A5S65D 
S601P 
C602W 
H612R 
A622P 
R624C/H 
E669G 
R699Q/W 
G709D 
R716W 
G720C * 
E722K 
C730E/F 
L732F/P 
R737H 
H753L 
R764P 
E770K 
R776Q 
R778Q/G/W 
P786L 
1794T © 
R841W/Q 
N8551 
L885R 
W896R 
Y912C 
1956T 
R973H/C 
F974L 
ROTSW/Q 
Q981R/P 
K992E/R 
M1006V 


RAG2 
G35V 
R39G 
C41W 
G95R 
R229E/Q/W 
M285R 


R391A 
R393A 
$398 
T400 
R401A 
R407A 
L408 
D426 
Vv430 
L432 
A441 
R446 
L451 
RA7IA 
S477 
L503 
R504 
G513 
W519 
D536 
R556 
R558 
A562 
$598 
C599 
H609L 
A619 
R621A 
£666 
R696 
G706 
R713A 
G717 
E719K 
C27 
L729 
R734A 
H750A 
R761 
E767 
R773A 
R775A 
P783 
1791 
R838 
N852 
L882 
W893A 
Y909 
1953R 
R970 
F971 
R972 
Q978 
K989 
M1003 


G35 
R39A 
C41 
G95 
R229 
M285 


nonamer binding 

nonamer binding 

nonamer binding 

nonamer binding 

nonamer binding 

nonamer binding 

Structural integrity of NBD 
Structural integrity of NBD 
Structural integrity of NBD 
Structural integrity of NBD 
Structural integrity of NBD 
Probably DNA binding (spacer) 
Structural integrity of RAG1 dimer 
Structure & DNA binding in DDBD 
Structure & DNA binding in DDBD 
Structural integrity of DDBD 
Solvent exposed, DNA binding? 
Structural integrity of RAG1 
Structural integrity of preR 
Exposed, near RAG1/2 interface 
At the edge of RAG1/2 interface 
RAG1/2 interface (T169 of RAG2) 
Structural integrity of preR 
Structural integrity of active site 
Structural integrity of active site 
Disordered, near RAG2 
Structural integrity of RNH 
Active site, adjacent to D600 
RAG1/2 interface 

Structural stability of RNH 
Structural integrity, active site 
Structural integrity of RAG12 
RAG1/2 interface 

RAGI1/2 interface 

Structural integrity of ZnC2 
Structural integrity of ZnC2 
Possibly DNA binding (coding end) 
Structural integrity of RAG1/2 
At the edge of RAGI/2 interface 
RAGI/2 interface 

RAGI/2 interface 

Structural integrity of RAG1/2 
RAGI/2 interface 

Structural integrity of ZnH2 
DNA binding (near heptamer) 
Interacts with 3’ end of RSS 
Structural integrity of ZnH2 
Structural integrity of ZnH2 
Structural integrity of ZnH2 
Structural integrity of ZnH2 
Heptamer binding (intra-subunit) 
Structural integrity of CTD 
Structural integrity of CTD 
Heptamer binding (inter-subunit) 
Probable DNA binding 

Domain interface of CTD-DDBD 


RAGI1/2 interface (E666 of RAG1) 
RAG1/2 interface (E719, R773) 
Structure, and RAG1/2 interface 
Structure integrity of RAG2 
RAGI1/2 interface (D546 of RAG1) 
Partially exposed, maybe structure 


Predicted Structural Effects 


ARTICLE 


All SCID/Omenn syndrome mutations here are listed in refs 12 and 13 except for three, for which references are given in the table (refs 58-60). Red residues are the mutations made in mouse***. 
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Extended Data Table 3 | Mouse RAG1-RAG2 mutations presented in ref. 4 


Mutations Location and potential functional roles 


RAGI1 mutations: RAG2 binding +, all other - 

K405A/H406A/ = Nonamer binding 

R407A 

R748A/H750A R748 is near ZnC2 and ZnH2 and structurally important. H750 may stabilize the ZnC2 structure. 
R773A/R715A R773 is at RAG1/2 interface between E719 (RAG1) and R39 (RAG2). R775 is exposed to solvent. 
H937A/K938A H937 coordinates the zinc, and K938 forms a salt-bridge with E709 next to the catalytic D708. 
H942A Zinc coordination 

R969A/R97T0A Next to CTD in the positive groove for DNA binding 


RAGI1 mutations: RSS binding +, Nicking and Hairpinning - 


K596A H-bond to the carbonyl of A957 and A955, stabilizes the W956 conformation in the apo-structure 
R621A/H Next to D600 in the active site 

R713A H-bond to “O” of E719, Y725 and 1726 (near RAG1/2 interface) 

E719K RAG1/2 interface 

R734A Solvent exposed, but could bind coding-end DNA 

W760A RAG1/2 interface 

H795A In the active site, next to D708 

W956A W956 is near E962 but facing exterior and separate from the active site by the protein backbone. 
RAGI1 mutations: Nicking +, Hairpinning - 

K608A Disordered 

H609L Disordered 

R855A/K856A K856 is oriented toward the CTD/DDBD, R855 is solvent exposed 

R890A Near D797 carboxylate, structure integrity, near R855 

W893A Structural integrity of ZnH2 

K980A On the CTD charged surface, potential for heptamer binding 


RAGI1 mutations: Joining negative or defective 
R401A/R402A Probable nonamer binding 


E423Q Forming a salt bridge with R407 that probably binds nonamer 
R440A Probable nonamer binding 

E547Q RAG1/2 interface 

$723 A/C Adjacent to E719 (OS) and close to R39 of RAG2 


RAGI1 mutations: Gain-of-function (12RSS processing) 
E649A Solvent exposed and adjacent to N961 and $963 near E962. 


RAGz2 mutations: RAG1 binding+, all other - 


K119A Solvent exposed and part of the positive top rim of the “Y” shaped RAG1/2 complex 
K283A/R Near M285R (OS), stabilizing the loop of 306-315 

RAG2 mutations: RSS binding+. Nicking and Hairpinning - 

K38A/R39A RAG1/2 interface, possible coding-end binding 

RAGz2 mutations: Joining negative or defective 

K34A Near OS mutation G35V, RAG1I/2 interface (adjacent to R73 and K97) 

K56A/K58A Solvent exposed, may interact with coding DNA flank 

R73A RAGI/2 interface (adjacent to K34 and K97) 

H94A H94 is a part of the RAG? structural core and important for the correct folding 
K97A Near OS mutation G95R at RAG1/2 interface (K34 and R73) 

K119R Exposed to solvent and adjacent to K56 and K58 

R167A Near RAGI interface (N525 of RAG1), forming positive-p stack with W172 (near P674 of RAGI) 


361frame shift without PHD and regulatory domains 


Mutations that correspond to SCID/Omenn syndrome mutations in human are highlighted in grey. R795A listed in ref. 4 (highlighted in red) should be H795A. 
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An ultraluminous quasar with a twelve-billion- 
solar-mass black hole at redshift 6.30 


Xue-Bing Wu'”, Feige Wang'”, Xiaohui Fan**, Weimin Yi**°, 


Wenwen Zuo’, Fuyan Bian’, Linhua Jiang’, Ian D. McGreer’, 


Ran Wang”, Jinyi Yang’, Qian Yang’, David Thompson? & Yuri Beletsky'® 


So far, roughly 40 quasars with redshifts greater than z = 6 have been 
discovered’ *. Each quasar contains a black hole with a mass of about 
one billion solar masses (10° Mo)**””".. The existence of such black 
holes when the Universe was less than one billion years old presents 
substantial challenges to theories of the formation and growth of black 
holes and the coevolution of black holes and galaxies'*. Here we report 
the discovery of an ultraluminous quasar, SDSS J010013.02 +280225.8, 
at redshift z = 6.30. It has an optical and near-infrared luminosity a 
few times greater than those of previously known z > 6 quasars. On the 
basis of the deep absorption trough” on the blue side of the Lyman-a 
emission line in the spectrum, we estimate the proper size of the ion- 
ized proximity zone associated with the quasar to be about 26 million 
light years, larger than found with other z > 6.1 quasars with lower 
luminosities'®. We estimate (on the basis of a near-infrared spectrum) 
that the black hole has a mass of ~1.2 x 10'° Mo, which is consistent 
with the 1.3 x 10'° Mo derived by assuming an Eddington-limited 
accretion rate. 

High-redshift quasars have been efficiently selected using a combi- 
nation of optical and near-infrared colours**. We have carried out a 
systematic survey of quasars at z > 5 using photometry from the Sloan 
Digital Sky Survey (SDSS)”, the two Micron All Sky Survey (2MASS)"® 
and the Wide-field Infrared Survey Explorer (WISE), resulting in the 
discovery of a significant population of luminous high-redshift quasars. 
SDSS J010013.02+280225.8 (hereafter JO100+2802) was selected asa 
high-redshift quasar candidate owing to its red optical colour (with 
SDSS AB magnitudes iap = 20.84 + 0.06 and zap = 18.33 + 0.03) and 
a photometric redshift of z ~ 6.3. It has bright detections in the 2MASS 
J, Hand K, bands with Vega magnitudes of 17.00 + 0.20, 15.98 + 0.19 
and 15.20 + 0.16, respectively; it is also strongly detected in WISE, with 
Vega magnitudes in W1 to W4 bands of 14.45 + 0.03, 13.63 + 0.03, 
11.71 + 0.21 and 8.98 + 0.44, respectively (see Extended Data Figs 1 
and 2 for images in different bands). Its colour in the two bluest WISE 
bands, W1 and W2, clearly differentiates it from the bulk of stars in our 
Galaxy”®. The object was within the SDSS-III imaging area. It is close to 
the colour selection boundary of SDSS z ~ 6 quasars’, but was assigned 
to low priority earlier because of its relatively red z,y — J colour and its 
bright apparent magnitudes. It is undetected in both radio and X-ray 
bands by the wide-area, shallow survey instruments. 

Initial optical spectroscopy on J0100+2802 was carried out on 29 
December 2013 with the Lijiang 2.4-m telescope in China. The low- 
resolution spectrum clearly shows a sharp break at about 8,800 A, con- 
sistent with a quasar at a redshift beyond 6.2. Two subsequent optical 
spectroscopic observations were conducted on 9 and 24 January 2014 
respectively with the 6.5-m Multiple Mirror Telescope (MMT) and the 
twin 8.4-m mirror Large Binocular Telescope (LBT) in the USA. The 
Lyman-« (Lya) line shown in the spectra confirms that J0100+2802 is 
a quasar at a redshift of 6.30 + 0.01 (see Fig. 1 and Methods for details). 


We use the multiwavelength photometry to estimate the optical lumi- 
nosity at rest-frame wavelength 3,000 A (L3,990), Which is consistent with 
that obtained from K-band spectroscopy (see below). The latter gives a 
more reliable value of (3.15 + 0.47) X 10” ergs, adopting a ACDM 
cosmology with Hubble constant Hy) = 70kms ‘Mpc ', matter den- 
sity parameter Qy, = 0.30 and dark energy density parameter Q , = 0.7. 
Assuming an empirical conversion factor from the luminosity at 3,000 A 
to the bolometric luminosity”, this gives Ly) = 5.15 X L3,o00 = 1.62 X 
108 ergs ~1= 429 X 10 Lo (where Lo is the solar luminosity). We 
obtain a similar result when estimating the bolometric luminosity from 
the Galactic extinction corrected absolute magnitude at rest-frame 
1,450 A, which is M450,an = — 29.26 + 0.20. The luminosity of this 
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Figure 1 | The optical spectra of J0100+2802. From top to bottom, spectra 
taken with the Lijiang 2.4-m telescope, the MMT and the LBT (in red, blue 
and black colours), respectively. For clarity, two spectra are offset upward by 
one and two vertical units. Although the spectral resolution varies from very 
low to medium, in all spectra the Lyx emission line, with a rest-frame 
wavelength of 1,216 A, is redshifted to around 8,900 A, giving a redshift of 6.30. 
J0100+2802 is a weak-line quasar with continuum luminosity about four times 
higher than that of SDSS J1148+5251 (in green on the same flux scale)", 
which was previously the most luminous high-redshift quasar known at 

z= 6.42. 
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quasar is roughly 4 times greater than that of the luminous z = 6.42 
quasar’ SDSS J1148+5251, and 7 times greater than that of the most 
distant known quasar® ULAS J1120+0641 (z = 7.085); it is the most 
luminous quasar known at z > 6 (see Extended Data Fig. 3). 

The rest-frame equivalent width of the Lyx + N v emission lines as 
measured from the LBT spectrum is roughly 10 A, suggesting that 
J0100+2802 is probably a weak-line quasar (WLQ)”. The fraction of 
WLQs is higher among the z ~ 6 quasars compared to those at lower 
redshift®, anda high detection rate of strong millimetre dust continuum 
in z~ 6 WLQs points to active star formation in these objects”. Given 
its extreme luminosity, J0100+2802 will be helpful in the study of the 
evolutionary stage of WLQs by future (sub)millimetre observations, 
though the origin of the weak ultraviolet emission line feature of WLQs 
is still uncertain. 

The LBT spectrum of J0100+2802 (Fig. 1) exhibits a deep Gunn- 
Peterson absorption trough’* blueward of the Lyx emission. The trans- 
mission spectrum (assuming an intrinsic power-law continuum of 
F, x A~'°, where F, is the flux density at wavelength /) is shown in 
Fig. 2. Complete Gunn—Peterson absorption can also be seen in the Lya, 
Lyf and Lyy transitions. Statistically significant transmission peaks are 
detected at z = 5.99 in both the Lya and Lyf troughs, and an additional 
transmission peak is detected at z = 5.84 in the Ly trough. The 2c lower 
limit on the Lyx Gunn-Peterson optical depth (t,,) at z = 6.00-6.15 is 
T, > 5.5 and the 2c lower limit for Lyf is t, > 6, corresponding to an 
equivalent t,, > 13.5, following the conversion in literature’®. The char- 
acteristics of the intergalactic medium (IGM) transmission along the 
line of sight of J0100+2802, including the deep Lyx and Ly troughs, 
and the narrow, unresolved transmission peaks, are similar to those 
observed in SDSS J1148+5251, and are consistent with the rapid increase 
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Figure 2 | Transmission in absorption troughs and the proximity zone for 
J0100+2802. a, b, Transmission in Lyx and Lyf absorption troughs 
(respectively a, red; b, blue) were calculated by dividing the spectrum by a 
power-law continuum”, F, « 4~'°. The shaded band in both panels shows 
lo standard deviation. The Lyo and Lyf absorption redshifts are given by 

Al Atyottyp) — 1, where zy, = 1,216 Aand Aryp = 1,026 A. The optical spectrum 
exhibits a deep Gunn-Peterson trough and a significant transmission peak at 
z= 5.99. c, Transmission in the proximity zone. The proper proximity zone 
for J0100+2802 (in black) extends to 7.9 + 0.8 Mpc, a much larger value than 
those of other z > 6.1 quasars, including 4.9 + 0.6 Mpc for J1148+5251 (in 
green), consistent with its higher ultraviolet luminosity. The transmission in 
c was calculated by dividing the measured spectrum by a power-law continuum 
F, x 4~** plus two Gaussian fittings of Ly and N v lines. The horizontal 
dotted line and the two dashed lines denote transmission values of 0, 0.1 and 1.0 
respectively, while the vertical dashed line denotes the proper proximity zone 
size of 0. 
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in the IGM neutral fraction at z > 5.5 observed in a large sample of SDSS 
quasars’®. The size evolution of the quasar proximity zone, which is 
highly ionized by quasar ultraviolet photons, can also be used to con- 
strain the IGM neutral fraction. The size of the proximity zone is defined 
by the point where the transmitted flux first drops by a significant 
amount to below 10% (ignoring small absorption leaks) of the quasar 
extrapolated continuum emission after the spectrum is smoothed to a 
resolution of 20 A (ref. 16). As shown in Fig. 2, J0100 +2802 has a much 
larger proper proximity zone (7.9 + 0.8 Mpc; 1 Mpcis about 3.26 million 
light years) than that of other SDSS quasars’*”° at z > 6.1; its large prox- 
imity zone size is expected from the higher level of photo-ionization 
dominated by quasar radiation. 

We obtained the near-infrared J,H,K-band spectra with Gemini and 
Magellan telescopes on 6 August and 7 October 2014, respectively (see 
Methods for details). Figure 3 shows the combined optical/near-infrared 
spectrum of J0100+2802 and the results of fitting the Mg tI emis- 
sion line. The Mg 1 full-width at half-maximum (FWHM) is 5,130 + 
150kms_', and the continuum luminosity at the rest-frame wavelength 
of 3,000 A is (3.15 + 0.47) X 10°” ergs |. After applying a virial black- 
hole mass estimator based on the Mg 1 line”, we estimate its black-hole 
mass to be (1.24 + 0.19) X 10'° M5. The uncertainty of black-hole mass 
does not include the systematic uncertainty of virial black-hole mass 
estimation, which could be up to a factor of three’. Assuming that this 
quasar is accreting at the Eddington accretion rate and the bolometric 
luminosity is close to the Eddington luminosity (Lgaq = 1.3 X 10°° 
(M/M 9)), similar to other z > 6 quasars"', leads to a black-hole mass of 
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Figure 3 | The combined optical/near-infrared spectrum of J0100+2802 
and the fitting of the Mg line. Main panel, the black line shows the LBT 
optical spectrum and the red line shows the combined Magellan and Gemini 
near-infrared J,H,K-band spectra (from left to right, respectively). The gaps 
between J and H and between H and K bands are ignored due to the low sky 
transparency there. The magenta line shows the noise spectrum. The main 
emission lines Lyx, Civ and Mgi are labelled. The details of the absorption 
lines are described in Extended Data Fig. 4. Inset, fits of the Mg! line (with 
FWHM of 5,130 + 150kms_') and surrounding Fe 11 emissions. The green, 
cyan and blue solid lines show the power law (PL), Fe 11 and Mgt components. 
The black dashed line shows the sum of these components in comparison with 
the observed spectrum, denoted by the red line. The black-hole mass is 
estimated to be (1.24 + 0.19) X 10!° Mo. 
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1.3X 10'°Mo for J0100+2802. Therefore, our observations strongly indi- 
cate that J0100+2802 harbours a black hole of mass about 1.2X 10'°Mo, 
the first such system known at z > 6, though black holes of such a size 
have been found in local giant elliptical galaxies** and low-redshift 
quasars”’. 

Although gravitational lensing is a possible explanation for the high 
luminosity of J0100+2802, we do not expect a large lensing magnifica- 
tion. An LBT K-band image with seeing of 0.4” shows a morphology fully 
consistent with a single point source (Extended Data Fig. 2); and the 
large size of the quasar proximity zone further supports a high ultra- 
violet luminosity consistent with the expected photoionization scaling”. 
However, absorption features at different redshift have been identified 
from its near-infrared spectroscopy (Extended Data Fig. 4), implying 
the existence of abundant intervening materials along the line of sight. 

J0100 +2802 is the only known quasar with a bolometric luminosity 
higher than 10** ergs * anda black-hole mass larger than 5 X 10° Mo 
at z= 6. It is also close to being the most luminous quasar with the 
most massive black hole at any redshift (Fig. 4). The discovery of this 
single ultraluminous quasar within the entire SDSS footprint (~ 13,000 
degrees’) is broadly consistent with the extrapolation of the SDSS z ~ 6 
quasar luminosity function’®. The number density of such objects would 
set strong constraints on the early growth of supermassive black holes 
and the evolution of the high-redshift quasar black-hole mass function*"’. 
In addition to ULAS J1120+0641 with a2 X 10° Mo black hole®” at 
z= 7.085, and a recently discovered z = 6.889 quasar with a black hole 
of 2.1 X 10° Mo (ref. 13), J0100+2802 witha 1.2X 10!°M.~ blackhole 
at z = 6.30 presents the next most significant challenge to the Eddington- 
limited growth of black holes in the early Universe'!”*. Its existence also 
strengthens the claim that supermassive black holes in the early Uni- 
verse probably grew much more quickly than their host galaxies, as 
argued from a molecular gas study of z ~ 6 quasars*’. Therefore, as the 
most luminous quasar known to date at z > 6, J0100+2802 will be a 
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Figure 4 | Distribution of quasar bolometric luminosities, L,,), and black- 
hole masses, Mgy; estimated from the Mg lines. The red circle at top right 
represents J0100+2802. The small blue squares denote SDSS high-redshift 
quasars*'®”’, and the large blue square represents J1148+5251. The green 
triangles denote CFHQS high-redshift quasars''’*. The purple star denotes 
ULAS J1120+0641 at z = 7.085 (ref. 6). Black contours (which indicate 1¢ to 
5q significance from inner to outer) and grey dots denote SDSS low-redshift 
quasars” (with broad absorption line quasars excluded). Error bars represent 
the lo standard deviation, and the mean error bar for low-redshift quasars 

is presented in the bottom-right corner. The dashed lines denote the luminosity 
in different fractions of the Eddington luminosity, Lgag. Note that the black- 
hole mass and bolometric luminosity are calculated using the same method 
and the same cosmology model as in the present Letter, and the systematic 
uncertainties (not included in the error bars) of virial black-hole masses could 
be up to a factor of three”. 
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unique resource for the future study of mass assembly and galaxy for- 
mation around the most massive black holes at the end of the epoch of 
cosmic reionization”®. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


The optical spectroscopy on J0100+2802 was first carried out on 29 December 
2013 with the Yunnan Fainter Object Spectrograph and Camera (YFOSC) of the 
Lijiang 2.4-m telescope in China. We used a very low resolution grism (G12, at a 
dispersion of 18 A per pixel) and took 3,000 s exposure on this target. The spectrum 
clearly showsa sharp break at about 8,800 Aandno significant emissions blueward, 
consistent with a quasar spectrum at a redshift beyond 6.2. To confirm this discov- 
ery, two subsequent optical spectroscopic observations were obtained on 9 and 24 
January 2014 with the 6.5-m Multiple Mirror Telescope (MMT) and the twin 8.4-m 
mirror Large Binocular Telescope (LBT) in the USA, respectively. The low to medium 
resolution spectra, obtained with 1,200 s exposure using the MMT Red Channel (at a 
dispersion of 3.6 A per pixel) and 2,400 s exposure with the LBT Multi-Object Double 
CCD Spectrographs/Imagers (MODS)" (at a dispersion of 1.8 A per pixel) respec- 
tively, explicitly confirm that SDSS J0100+2802 is a quasar at redshift 6.30 + 0.01 
(obtained by the Ly line). 

The near-infrared K-band spectroscopy on J0100+2802 was carried out with 
LBT/LUCI-1 on 2 January 2014. Owing to the short exposure time (15 min), the 
spectrum is of modest signal-to-noise ratio (S/N). Although the Mg 1! line was clearly 
detected, the noisy LBT spectrum did not allow us to accurately measure the line 
width. To improve the quality of the near-infrared spectrum, we obtained J,H,K- 
band spectroscopy with Gemini/GNIRS and Magellan/FIRE on 6 August and 
7 October 2014, respectively. The exposure time was 3,600 s for GNIRS and 3,635 s 
for FIRE. The FIRE spectrum has higher S/N (about 30 in K band) and higher spectral 
resolution (R = 1/Ad ~ 6,000) than the GNIRS spectrum (with S/N of about 10 in 
K band and R = 1,800). In order to achieve the best spectral quality, we combined 
the FIRE and the GNIRS spectra, and scaled the combined spectrum according to 
its 2MASS J,H,K,-band magnitudes. The Mg 1 line shown in the K-band spectrum 
gives the same redshift as that given by the Ly« line in the optical spectrum. The 
high-quality J,H,K-band spectra also clearly display abundant absorption features, 
which have been identified as being from intervening or associated systems with 
redshifts from 2.33 to 6.14 (Extended Data Fig. 4). 

After redshift and Galactic extinction corrections, the rest-frame H- and K-band 
spectrum is decomposed into a pseudo-continuum and the Mg 1! emission line. The 
pseudo-continuum consists of a power-law continuum and Fel emissions, and 


is fitted within the rest-frame wavelength range between 2,000 A and 3,200 A by 
excluding the boundary region between H and K bands where the sky transparency 
is lower. An Fe 1 template*””? is adopted for the fitting of Fe 11 emissions. The Mg 1 
emission line is fitted with two broad Gaussian components. The four Mg 11 absorp- 
tion lines near the redder part of the Mg 1! emission line are also fitted as four Gaussian 
lines in order to remove their effects on the fittings of Mg 11 and Fe 11 emission lines. 
The overall FWHM of the Mg i emission line is ~5,130 km s | 1 with an uncertainty 
of 150kms_'. The continuum has a slope of — 1.43 and the continuum luminosity 
at the rest-frame wavelength of 3,000 A (L3,990) is (3.15 + 0.47) X 10°” erg s |. The 
Feil to Mgt! line ratio is 2.56 + 0.18, which is consistent with the mean value of 
other z > 6 quasars'*”’. After applying a virial black-hole mass estimator based on 
the Mgt line*®, we estimate its black-hole mass to be (1.24 + 0.19) X 10'° Mo. 
Although the systematic uncertainty of virial black-hole mass estimation can be up 
a factor of three’’, our result still strongly indicates that J0100 +2802 hosts a central 
black hole with mass close to 1.2 X 10!° Mo. This is also well consistent with a 
black-hole mass obtained by assuming an Eddington luminosity of J0100+2802, 
which leads to a mass of 1.3 X 10'° Mo. Considering the contribution of Balmer 
continuum, as done for other z>6 quasars'”!’, leads to a decrease of L3,999 to 
(2.90 + 0.44) x 107” ergs  anincrease of FWHM of Mg 11to 5,300 + 200 km s A 
and yields a black-hole mass of (1.26 + 0.21) X 10'° M«. Therefore, the effect of 
considering Balmer continuum is insignificant for the black-hole mass measure- 
ment of J0100+2802. In addition, if we adopt a different virial black-hole mass 
scaling relation”, the black-hole mass changes to (1.07£0.14) X 10!° Mo, which 
is still well consistent with the result we obtained above. 

Sample size. No statistical methods were used to predetermine sample size. 
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Extended Data Figure 1 | Images of J0100+2802 in SDSS, 2MASS and source in the bands with high signal-to-noise detections. The size is 1’ X 1’ for 
WISE bands. J0100+2802 is undetected in SDSS u,g,r bands (top row) but is _all images. The green circle represents an angular size of 10” in each image. 
relatively bright in other bands (lower three rows). It is consistent with a point 
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Extended Data Figure 2 | The LBT K-band image of J0100+2802. The size 
is 10” X 10”. The horizontal and vertical axes denote the offsets in right 
ascension (ARA) and in declination (ADec.). The image, with seeing of 0.4”, 
shows a morphology fully consistent with a point source. 
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Extended Data Figure 3 | The rest-frame spectral energy distributions of of J1148+5251, and seven times higher than that of ULAS J1120+0641. The 
J0100+2802, J1148+5251 and ULAS J1120+0641. The redshifts of these photometric data are from literature for J1148+5251 and J1120+0641. The 
three quasars are 6.30, 6.42 and 7.085, respectively. The luminosity of error bars show the lo standard deviation. 

J0100+ 2802 in the ultraviolet/optical bands is about four times higher than that 
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Extended Data Figure 4 | The major absorption features identified from absorption materials at 6.14, 6.11, 5.32, 5.11, 4.52, 4.22, 3.34 and 2.33, 


optical and near-infrared spectroscopy of J0100+2802. Most of them are respectively. Studies of intervening and associated absorption systems will be 
from Mg, C1v and Fe 11. The labels from A to H correspond to the redshifts of discussed elsewhere. 
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Quantum teleportation of multiple degrees of 
freedom of a single photon 


Xi-Lin Wang’, Xin-Dong Cai’, Zu-En Su'*, Ming-Cheng Chen!*, Dian Wu’, Li Li>*, Nai-Le Liu'*, Chao-Yang Lu? 


& Jian-Wei Pan!? 


Quantum teleportation’ provides a ‘disembodied’ way to transfer quan- 
tum states from one object to another at a distant location, assisted 
by previously shared entangled states and a classical communica- 
tion channel. As well as being of fundamental interest, teleportation 
has been recognized as an important element in long-distance quantum 
communication’, distributed quantum networks’ and measurement- 
based quantum computation**. There have been numerous demonstra- 
tions of teleportation in different physical systems such as photons**, 
atoms’, ions'®”, electrons’? and superconducting circuits’*. All the 
previous experiments were limited to the teleportation of one degree 
of freedom only. However, a single quantum particle can naturally 
possess various degrees of freedom—internal and external—and with 
coherent coupling among them. A fundamental open challenge is to 
teleport multiple degrees of freedom simultaneously, which is neces- 
sary to describe a quantum particle fully and, therefore, to teleport 
it intact. Here we demonstrate quantum teleportation of the com- 
posite quantum states of a single photon encoded in both spin and 
orbital angular momentum. We use photon pairs entangled in both 
degrees of freedom (that is, hyper-entangled) as the quantum chan- 
nel for teleportation, and develop a method to project and discrim- 
inate hyper-entangled Bell states by exploiting probabilistic quantum 
non-demolition measurement, which can be extended to more de- 
grees of freedom. We verify the teleportation for both spin-orbit 
product states and hybrid entangled states, and achieve a teleporta- 
tion fidelity ranging from 0.57 to 0.68, above the classical limit. Our 
work is a step towards the teleportation of more complex quantum 
systems, and demonstrates an increase in our technical control of 
scalable quantum technologies. 

Quantum teleportation is a linear operation applied to quantum states, 
and so teleporting multiple degrees of freedom (DoFs) should be pos- 
sible in theory’. Suppose Alice wishes to teleport to Bob the composite 
quantum state of a single photon (photon 1; Fig. 1a), encoded in both 
the spin angular momentum (SAM) and the orbital angular momentum 
(OAM) as follows: 


|e), = 20)5 10); + BlO);|1)2 +711) 10)9 + 4]1)4|1)7 


Here |0)* and |1)* denote horizontal and vertical polarizations of the 
SAM; |0)° and |1)° refer to right-handed and left-handed OAMs of 
+h and —h, respectively; and a, f, y and 6 are complex numbers 
satisfying |a|” + ||" +|y|? +|6|? =1. For Alice to do so, she and Bob 
first need to share a hyper-entangled photon pair (photons 2 and 3), 
which is simultaneously entangled in both SAM and OAM: 


\E)o3= Ib )o3@™ )os 


1 


= 5 ((0)310)3 —[1)211)3)(10)210)3 + |1)211)3) 


Here we use |p) =((0)*0)*+|1)*|1)*)/V2 and |) =((0)*|1)°+ 
|1)°|0)°)/\/2 to denote the four Bell states encoded in SAM, and 
|co* ) = ({0)°|0)° + |1)°|1)°)/v2 and |x*) =(|0)°|1)° +|1)°|0)°)/v2 
to denote the four Bell states in OAM. Their tensor products result in a 
group of 16 hyper-entangled Bell states. 

A crucial step in the teleportation is to perform a two-particle joint 
measurement of photons 1 and 2, projecting them onto the basis of the 
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Figure 1 | Scheme for quantum teleportation of the spin-orbit composite 
states of a single photon. a, Alice wishes to teleport to Bob the quantum state 
of a single photon 1 encoded in both its SAM and OAM. To do so, Alice 

and Bob need to share a hyper-entangled photon pair 2-3. Alice then carries out 
an h-BSM assisted by a QND measurement (see main text for details) and sends 
the results as four-bit classical information to Bob. On receiving Alice’s h-BSM 
result, Bob can apply appropriate Pauli operations (denoted U° and U* for 
the OAM and SAM DoFs, respectively) on photon 3 to convert it into the 
original state of photon 1. The active feed-forward is essential for a full, 
deterministic teleportation. In our present proof-of-principle experiment, we 
did not apply feed-forward but used post-selection to verify the success of 
teleportation. BS, beam splitter. b, Teleportation-based probabilistic QND 
measurement with an ancillary entangled photon pair. An incoming photon 
can cause a coincidence detection after the beam splitter, which heralds its 
presence and meanwhile fully teleports its arbitrary unknown quantum state 
(|®)) to a free-flying photon. In the case of no incoming photon, the 
coincidence event after the beam splitter cannot happen, thus indicating the 
photon’s absence. 
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16 orthogonal and complete hyper-entangled Bell states, and discrim- 
inating one of them; for example 


Ie)10= 1 )2lO*) yo 


1 Ss s s s ce) oO fe) fe) 
= 5 (10); 10) —[2)111)2)(10)110)3 +11) 11)) 
This process is referred to as hyper-entangled Bell state measurement 
(h-BSM). After the h-BSM that projects photons 1 and 2 onto the state 


|€) 1, photon 3 will be projected onto the initial state of photon 1: 
1@)3= %|0)3|0)3 + BO)5|1)3 +711)310)3 + 9]1)5|1)5 


With equal probabilities of 1/16, photons 1 and 2 can also be projected 
onto one of the other 15 hyper-entangled Bell states (Methods). The 
h-BSM results can be broadcast as four-bit classical information, which 
will allow Bob to apply appropriate Pauli operations to perfectly recon- 
struct the initial composite state of photon 1. 

Experimental realization of the above teleportation protocol poses 
significant challenges to the coherent control of multiple particles and 
multiple DoFs simultaneously. The most difficult task is to implement 
the h-BSM, because it would normally require coherently controlled 
gates between independent quantum bits (qubits) of different DoFs. 
Moreover, with multiple DoFs, it is necessary to measure one DoF 
without disturbing any other. With linear operations only, previous 
theoretical work"* has suggested that it is impossible to discriminate 
the hyper-entangled states unambiguously. This challenge has been 
overcome in our work. 

Figure 1 illustrates our linear optical scheme for teleporting the spin- 
orbit composite state. The h-BSM is implemented in a step-by-step 
manner, as a combination of two separate BSMs. First, photons 1 and 2 
are sent through a polarizing beam splitter (PBS), an optical device that 
transmits horizontal polarizations (|0)*) and reflects vertical polariza- 
tion (|1)*). After the PBS, we post-select the event that there is one and 
only one photon in each output. Such an event can occur only if the two 
input photons have the same SAM (both are transmitted (|0);|0)5) or 
reflected (|1)}|1)3)), which projects the SAM part of the wavefunction 
into the two-dimensional subspace spanned by |¢*) =(|0)°|0)° +]1)° 
|1)°)/./2. Atboth outputs of the PBS, we add two polarizers, projecting 
the two photons into the diagonal basis (|0)° + |1)°)/2. It should be 
noted that the PBS is not OAM-preserving”’, because the reflection at 
the PBS flips the sign of the OAM qubit; that is, |0);|0)$ > |0);,|0)¢,, 
[0)$|1)2—+ jo)$,|1)9.,|1)$]0)2—si/1)3|1)$ and |1)$]1)$—+7/1)5,10)5.. 
Thus, both the SAM and the OAM must be taken into account as in 
a molecular-like coupled state. A detailed mathematical treatment is 
presented in Methods, showing that the PBS and two polarizers select 
the four following states out of the total 16 hyper-entangled Bell states: 
1b )121~ ) 125 18) r2l* ) ral” )ralz* )12 and |) lz) 2 

Second, having measured and filtered out the SAM, we perform 
BSM on the remaining OAM qubit. The two single photons that emerge 
from the PBS are superposed on a beam splitter (Fig. 1a). Only the 
asymmetric Bell state will lead to a coincidence detection where there is 
one and only one photon in each output’®, whereas for the three other 
symmetric Bell states, the two input photons will coalesce to a single 
output mode. Note again that reflection at the beam splitter inverts the 
OAM sign. Therefore, the state |@~ ),, can be distinguished by a coin- 
cidence detection in separate outputs, and |~* ),, can be discriminated 
by measuring two orthogonal OAMs in either output. In total, these 
two steps would allow an unambiguous discrimination of the two hyper- 
entangled Bell states |f*),,|~ ),, and |¢~),,|o*) 5. 

However, these interferometric processes cannot be simply cascaded. 
They depend on the assumption that two input photons are from dif- 
ferent paths. When directly connected, the 50% chance of both photons 
coalescing into a single output spatial mode after the first ‘interferometer’ 
would remain undetected (no ‘coincidence’ event occurs), and would 
further induce erroneous detection of a coincidence event after the second 
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interferometer, thus failing both BSMs. This is a ubiquitous and impor- 
tant problem in linear optical quantum information processing””’. 

Our remedy is to exploit quantum non-demolition (QND) measure- 
ment, whereby a single photon is observed without destroying it and 
keeping its quantum information intact. Interestingly, quantum tele- 
portation itself can be used for probabilistic QND detection”’*. As shown 
in Fig. 1b, another pair of photons entangled in OAM is used as ancillary. 
The procedure is a standard teleportation. If there is an incoming photon, 
a two-photon coincidence detection behind a beam splitter can occur 
with 50% efficiency, triggering a successful BSM, which heralds the pre- 
sence of the incoming photon and teleports the full quantum state of 
the incoming photon to a freely propagating photon. If there is no 
incoming photon, two-photon coincidence behind the beam splitter 
cannot occur, in which case we will know and will ignore the outgoing 
photon. Thus, using the QND measurement, the BSM interferometers 
can be concatenated. We note that using a QND in one of the arm of 
the beam splitter is sufficient owing to the conserved total number of 
eventually registered photons. Our protocol can identify two hyper- 
entangled Bell states with an overall efficiency of 1/32 (Methods). The 
QND method can also be used to boost the efficiency of photonic quan- 
tum logic gates in the Knill-Laflamme-Milburn scheme for scalable 
optical quantum computing’. 

Figure 2 shows the experimental set-up for quantum teleportation of 
the spin-orbit composite state of a single photon. Passing a femtosecond- 
pulsed laser through three type-I B-barium borate crystals generates 
three photon pairs’’’, engineered in different forms (Methods). The 
first photon pair (1—t) is used to prepare a heralded single photon (1) to 
be teleported, triggered by the detection of its sister photon (t). The 
second pair (2-3) is created in the hyper-entangled state |) ,. The third 
pair (4-5) is prepared in the ancillary OAM-entangled state |w* ) ,, for 
the teleportation-based QND measurement. 

We prepare five different initial states to be teleported: |g), 
=|0)"10)°, l@)e= |1)°|1)°s @)c= (10)°+]1)°)(10)° + |1)°)/2, 1P)p 
= (|0)'+i]1)*)(J0)°+:1)°)/2 and |p), = (J0)*|0)°+|1)*|1)°)/V2. 
These states can be grouped into three categories: |p), and |~), are 
product states of the two DoFs in the computational basis; |). and 
|) p are products states of the two DoFs in the superposition basis; and 
|g), is a spin-orbit hybrid entangled state. The four product states are 
prepared by independent single-qubit rotations, using wave plates for 
SAM and spiral phase plates (SPPs) or binary phase plates (BPPs) for 
OAM. The entangled state is generated by counter-propagating a SPP 
inside a Sagnac interferometer (Methods). 

The implementation of the h-BSM and the QND measurement 
requires Hong-Ou-Mandel-type interference” between indistinguish- 
able single photons with good time, spatial and spectral overlap. The 
two photons are synchronized to arrive at the PBS and beam splitters 
within 10 fs of each other, a delay that is much smaller than the coher- 
ence time of the down-converted photons (~448 fs), which is stretched 
by narrowband spectral filtering (~3 nm). A step-by-step verification 
of the two-photon interference as a function of temporal delay is pre- 
sented in Methods and Extended Data Fig. 1. We observe a visibility of 
0.75 + 0.03 for the interference of the two SAM-encoded photons at 
the PBS, and visibilities of 0.73 = 0.03 and 0.69 + 0.03 for the inter- 
ferences of two OAM-encoded photons at beams splitter 1 and, res- 
pectively, beam splitter 2. 

The final verification of the teleportation results relies on the coin- 
cidence detection counts of the six photons encoded in both DoFs, 
which would suffer from a low rate. It should be noted that all the previous 
experiments’*”°”!?*° with OAM states have never gone beyond two 
single photons. We overcome this technical challenge by preparing high- 
brightness, hyper-entangled photons and designing dual-channel and 
high-efficiency OAM measurement devices (Methods). 

To evaluate the performance of the teleportation operation, we mea- 
sure the fidelity of the teleported state, F = Tr(|¢) (¢|), which is defined 
as the overlap of the ideal teleported state (|y)) and the measured 
density matrix (~). Conditioned on the detection of the trigger photon 
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t and the four-photon coincidence after the h-BSM, we register the 
photon counts of teleported photon 3 and analyse its composite state. 
The fidelity measurements for the product states are straightforward 
because we can measure the SAM and OAM qubits separately. We mea- 
sure the final state in the (|0)*,|1)*) and (|0)°,|1)°) bases for the tele- 
portation of |g) , and |g), in the (|0)° + |1)°)//2 and (|0)° +|1)°)/V2 
bases for the teleportation of |g), and in the (|0)*+i|1)°)/V2 and 
(|0)° +i|1)°)/V2 bases for the teleportation of |g) ). The data in Fig. 3 
yield teleportation fidelities of 0.68 + 0.04, 0.66 + 0.04, 0.62 + 0.04 
and 0.63 + 0.04 for |g) ,, |@), |”). and, respectively, |g) p. 

The fidelity of the entangled state |v), where the SAM and OAM 
qubits are not separable, can be decomposed as Fy = Tr[p(I + 63.62 — 
G6, +6562)|/4, where o,, a, and o, are the Pauli operators. The 
expectation values of the joint observables 6%.a%, 6a) and a3a$ can 
be obtained by local measurements in the corresponding basis for the 
two DoBs, (|0) +|1))/V2, (|0) +i|1))/v2 and (|0),|1)), respectively. 
Our experimental results on the three different bases are presented in 
Fig. 4a—c, from which we determine a teleportation fidelity of 0.57 + 0.02 
for the entangled state. 

We note that all reported data are without background subtraction. 
The main sources of error include double pair emission, imperfection 
in the initial states, entanglement of photons 2-3 and 4-5, two-photon 
interference and OAM measurement. We note that the teleportation 
fidelities of the states in the three categories are affected differently by 
errors from different sources (Methods). Despite the experimental noise, 
the measured fidelities (summarized in Fig. 4d) of the five teleported 
states are all well above 0.40—the classical limit, defined as the optimal 
state-estimation fidelity on a single copy of a two-qubit system”. These 
results prove the successful realization of quantum teleportation of the 
spin-orbit composite state of a single photon. Furthermore, for the 
entangled state, |g) ,, we emphasize that the teleportation fidelity exceeds 
the threshold of 0.5 required to prove the presence of entanglement”®, 
which demonstrates that the hybrid entanglement of different DoFs 
inside a quantum particle can survive teleportation. 

We have reported the quantum teleportation of multiple properties 
of a single quantum particle, demonstrating the ability to coherently 
control and simultaneously teleport a single object with multiple DoFs 
that forms in a hybrid entangled state. It is interesting to note that these 
DoFs can also be ina fully undefined state, such as being part of a hyper- 
entangled pair, which would lead to the protocol of hyper-entanglement 
swapping~’. Our methods can be generalized to more DoFs (see Methods 
for a universal scheme). The efficiency of teleportation, which is lim- 
ited mainly by the efficiency of h-BSM (1/32 in the present experiment), 
can be enhanced by using more ancillary photons, quantum encoding, 
embedded teleportation tricks, high-efficiency single-photon detectors 
and active feed-forward, in a spirit similar to the Knill-Laflamme- 
Milburn scheme’. We did not implement the feed-forward in the current 
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Figure 2 | Experimental set-up for teleporting 
multiple properties of a single photon. A pulsed 
ultraviolet (UV) laser is focused on three B-barium 
borate (BBO) crystals and produces three 

photon pairs in spatial modes 1-t, 2-3 and 4-5. 
Triggered by its sister photon, t, photon 1 is 
initialised in various spin-orbit composite states 


OAM 
BSM 


hyper-entangled in both SAM and OAM. The third 
pair, 4-5, is OAM-entangled. The h-BSMs for 
photons 1 and 2 are performed in three steps: 

(1) SAM BSM; (2) QND measurement; (3) OAM 
BSM. The teleported state is measured separately 
in SAM and OAM: a PBS, a half-wave plate 
(HWP) and a quarter-wave plate (QWP) are 
combined for SAM qubit analysis, and an SPP or 
a BPP together with a single-mode fibre are used 


bd Teleported for OAM qubit analysis. 


state 


experiment, but it could be done using electro-optical modulators for 
both the SAM and the OAM qubits (see Methods for a detailed pro- 
tocol). Although the present work is based on linear optics and single 
photons, the multi-DoF teleportation protocol is by no means limited 
to this system, but can also be applied to other quantum systems such 
as trapped electrons”’, atoms’ and ions’. 

As well as being of fundamental interest, the methods developed in 
this work on the manipulation of quantum states of multiple DoFs will 
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Figure 3 | Experimental results for quantum teleportation of spin-orbit 
product states |g) ,,|9)3,|~) and |g), ofa single photon. a, b, Measurement 
results of the final state of the teleported photon 3 in the (|0)*,|1)°) and 
(|0)°,|1)°) bases for the |g) , (a) and |g), (b) teleportation experiment. c, The 
results for the |g), teleportation, measured in the (|0)*+|1)°)/W2 and 

(|0)° +|1)°)/V2 bases. d, The results for the |g),, teleportation, measured in 
the (0)° + i|1)°)/V2 and (|0)° +i]1)°)/V2 bases. The x axis uses Pauli 
notation for both DoFs: (|0),,|1),) =(0),|1)), ([0),|1),.) = (0) £|1))/W2 and 
(0),.|1),)=(0) + a 1))/V2. The y axis is six-photon coincidence counts. 
Error bars, 1 s.d., calculated from Poissonian counting statistics of the raw 
detection events. 
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Figure 4 | Experimental results for quantum teleportation of spin-orbit 
entanglement of a single photon. a-—c, To determine the state fidelity of 

the teleported entangled state, |p), = (|0)°|0)° +|1)°|1)°)/V2, three 
measurement bases are required: (|0)°,|1)*) and (|0)°,|1)°) (a), (]0)* + |1)9)/W/2 
and (\0)° + |1)°)/V2 (b), and (|0)* +i|1)°)/V2 and (|0)° +i|1)°)/V2 (0. 
These are used to extract the expectation values of the joint Pauli observables 
6,62, 6,6, and 6,,6,,, respectively. Each measurement takes 12 h. The data set in 
a determines the population of the two desired terms: |0)*|0)° and |1)*|1)° in 
the entangled state, |g). The data set in b and c measured in the superposition 
basis determines the coherence of the entangled state. We use the same Pauli 
notation as in Fig. 3. d, A summary of the teleportation fidelities for the states 


statistics of the raw detection events. 


open up new possibilities in quantum technologies. Controlling mul- 
tiple DoFs makes complete SAM Bell-state analysis” and alignment-free 
quantum communication” possible, for example. Carrying multiple DoFs 
on photons can enhance the information capacity in quantum commu- 
nication protocols such as quantum super-dense coding”. Moreover, 
combining the entanglement of multiple (M) photons and high (N) 
dimensions from the multiple DoFs would allow the generation of 
hyper-entanglement with an expanded Hilbert space that grows in size 
as N™, which would provide a versatile platform in the near future for 
demonstrations of complex quantum communication and quantum 
computing protocols, as well as extreme violations of Bell inequalities. 
Online Content Methods, along with any additional Extended Data display items 


and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 

The protocol for teleporting a spin-orbit composite quantum state. The com- 
bined state of photons 1, 2 and 3 can be rewritten in the basis of the 16 orthogonal 
and complete hyper-entangled Bell states as follows: 
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where the triples (a%,, mn 6S) and (6%, a> 6°) are the Pauli operators for the SAM 
and, respectively, the OAM qubits. It indicates that, regardless of the unknown 
state|y),, the 16 measurement outcomes are equally likely, each with a probability 
of 1/16. By carrying out the hyper-entangled Bell state measurement (h-BSM) on 
photons 1 and 2 to unambiguously distinguish one from the group of 16 hyper- 
entangled Bell states, Alice can project photon 3 onto one of the 16 corresponding 
states. After Alice tells Bob her h-BSM result as four-bit classical information via a 
classical communication channel, Bob can convert the state of his photon 3 into 
the original state by applying appropriate two-qubit local unitary transformations 
(1,88, 08,62) @,68,0°,6%)). 

Two-photon interference of spin-orbit composite state on a PBS. The input 
and output of photons 1 and 2 encoded in SAM and OAM on a PBS as shown in 
Fig. 1 can be summarized as follows: |0)}|0)]— |0)},|0)1,, |0)5|1)? > [0)),|1)9)5 
|1)5|0)? i] 1)5,|1)5,|1)$]1) >] 1)5-|0)5,. Note that the PBS is not OAM-preserving, 
because the reflection flips the sign of OAM. Therefore, the output state for each of 
the input 16 hyper-entangled Bell states can be listed below: 


(6 )al* 1a Ty DvDa +AyAx)|O7 ) 1 (1) 
Ib) plom) a Dvds + Ay A207 Dy (2) 
0 )ale*)i2> GOvDa tarde") (3) 
ule a Dvds + Av At )yy (4) 
16" )ialO*)r2-* Ge (Dd +AyDy)|o*) (5) 
I )2lom) n> qv Ay + AyDy)|07) yy (6) 
ey ale) a Dray + AyDz Ix" vy (7) 


= 1 EA 
0* )plx n> Fg Dv Ay + AvDy)|x dvr (8) 


A 


i 
yt dp ot) > 3 Prty Dy ly —AypryAyly 


+ Dy ry Dy by — Az rz Ax hy) 


i 
a) at) > (Dy ry Dy ly —Ayr, Aylhy 
12 12 2 1 (10) 
— Dy ry Dy ly + Az ty Ay Ly) 


i 
\y*) |o~) = =(Dy ry Ayly —Aypry Dy ly 
12 12 2 (11) 
+ Dy Ty Az by = Ay 1! Dy by ) 


i 
IW) 107) 9 > 5 (Dy tv Av ly —Ayty Dy ly 
2 25 1 (12) 
—Dyty Ag ly + Ay ty Dy hy) 


i 
Wr) lat) — (Dy Dy —AvAy ry ry thy) 

4 (13) 
+ (Dy Dy —Ay Ay (ry P+ Ly by) 


= i 
\y ald”) glOvDy —AyAr \ryry +hly) (14) 
— (Dy Dy — Ay Ax) (ty Tz + Lyly)| 


i 
WW )lx o> 4 (Dy Dy —AvAy ryry —hly) (15) 
— (Dy Dy — Ay Ax) (1272 — Lyly)| 


i 
Wo )olx p> 7 [((Dy Dy —Ay Ay (ry ry —lly) 
+ (Dy Dy — Ay Ay (ta Ty — brly)| 


where D=(|0)* +|1)°)/V2,A=({0)°—|1)°)/V2,r=|0)° and 1=|1)°. Then both 
of the output modes are passing through two polarizers at 45’. Only the output with 
the term Dy D> will result in there being one photon in each output mode with an 
efficiency of 1/8. The other terms will be rejected by the conditional detection. 
According to equations (1)-(16), 4 of the 16 output modes have the term Dy Dy 
and the 4 corresponding input hyper-entangled Bell states are |f*),,|w ) 15 
1b) ilo )iae1P™ )r2lz* dig and |b) 121% )12» which will be sent to the next stage 
of BSM on the OAM qubits. 

BSM and teleportation of OAM qubits. Having measured and filtered out the 
SAM qubit, next we perform BSM on the OAM qubit. Dealing with a single degree 
of freedom is more straightforward, and can be implemented using a beam splitter. 
Like the PBS, the beam splitter also is not OAM-preserving, because the reflection 
at the beam splitter will flip the sign of OAM. Therefore, the transformation rules 
at a beam splitter for the four OAM Bell states are 
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We can see that only the Bell state |~ ),, will result in there being one and only 
one photon in each output, whereas for the three other Bell states the two input 
photons will coalesce into a single output mode. Among these three Bell states, the 
state |w*),, can be further distinguished by measuring two single photons in 


©2015 Macmillan Publishers Limited. All rights reserved 


either output with the OAM orthogonal basis (|0)°,|1)°). Thus, this would allow us 
to unambiguously discriminate two from the four Bell states. Experimentally, we 
design dual-channel OAM readout devices to measure |0)° and |1)° simultaneously. 
Therefore, the efficiency of both QND and BSM on OAM is 1/2. The overall 
efficiency of h-BSM combining all three steps is 1/8 x 1/2 x 1/2=1/32. 
Generating three photon pairs. Ultrafast laser pulses with an average power of 
800 mW, central wavelength of 394 nm, pulse duration of 120 fs and repetition rate 
of 76 MHz successively pass through three B-barium borate (BBO) crystals (Fig. 2) 
to generate three photon pairs through type-I spontaneous parametric down- 
conversion (SPDC). The first and the third crystals are 2 mm-thick BBOs, whereas 
the second one consists of two 0.6 mm-thick contiguous type-I BBO crystals with 
optic axes aligned in perpendicular planes. All of the down-converted photons 
have central wavelengths of 788 nm. The first photon pair (1-t) is initially prepared 
in the zero-order OAM mode with a coincidence count rate of 3.6 x 10°Hz. In 
SPDC, the zero-order OAM mode has the highest weight among all OAM modes 
and thus has the highest brightness. The second photon pair (2-3) is simulta- 
neously entangled in both the SAM and the OAM (in the first-order mode), owing 
to OAM conservation in the type-I SPDC. It has a count rate of 3.4 x 10*Hz anda 
state fidelity of 0.95. The third photon pair (4-5) is created in the first-order OAM 
entangled state with a two-photon count rate of 1.2 x 10°Hz and a fidelity of 0.91. 
Weestimate the mean numbers of photon pairs generated per pulse as ~0.1, ~0.01 
and ~0.05 for the first, second and third pairs, respectively. 

Preparing spin-orbit entanglement. As illustrated in the yellow panel of Fig. 2, 
we use a Sagnac interferometer to prepare the spin-orbit hybrid entangled state to 
be teleported, |~), =(|0)°|0)° +|1)°|1)°)/./2. Photon 1 is initially prepared in the 
(\0)°—|1)°)/V2 SAM state with zero-order OAM. It is then sent into a Sagnac 
interferometer that consists of a PBS and an SPP, where the counter-propagating 
|0)* and |1)* SAMs with zero-order OAM pass through the SPP in opposite direc- 
tions and are converted into |0)*|0)° and |1)*/1)°, respectively. Considering an 
extra 7 phase shift for the |1)* SAM from double reflections in the PBS, the final 
output state is (|0)*|0)° + |1)°|1)°)/./2. We note that the whole conversion process 
is deterministic. 

Two-photon interference on the PBS and beam splitters. For a test of two- 
photon interference visibility at the PBS, input photons 1 and 2 are first intentionally 
prepared in the states D,r; A2l, (orthogonal SAMs; see open squares in Extended 
Data Fig. 1a) and Dj 1) D2l, (parallel SAMs; see solid circles in Extended Data Fig. 1a), 
respectively. We measure each single photon from the two outputs of the PBS in 
the Dy Dy basis (1' and 2’ denote the output spatial modes). At zero delay, where 
the photons are optimally overlapped in time, the orthogonal SAM input yields an 
output state of (|0)$,|0)5, + |1)$,|1)$,)/ V2 conditioned on a coincidence detection, 
which can be decomposed in the diagonal basis as (|D)‘,|D)5, + |A)*, |A)5))/V2, 
thus showing an enhancement. For the parallel SAM input, the output state is 
(\D)$,|A)S, + |A)$,|D)$,)/V2, which shows a reduction. The increase in the delay 
gradually destroys the indistinguishability of the two photons, such that their 
quantum state becomes a classical mixture. Thus, at large delays the counts appear 
flat. Interferometers of this type are sensitive only to length changes of the order of 
the coherence length of the detected photons and stay stable for weeks. 

The two-photon interferences on beam splitters 1 and 2 are for the teleportation- 

based QND measurement of OAM qubits and the BSM of OAM qubits, respectively. 
As a test, the two input photons are prepared in the same SAM but orthogonal 
OAM states. It is interesting to note that, in stark contrast to the conventional 
Hong-Ou-Mandel interference, only having the two input OAM states ortho- 
gonal can lead to an interference dip, because the reflection at the beam splitter flips 
the sign of OAM. The two-photon interference as a function of temporal delay is 
shown in Extended Data Fig. 1b, c. 
Dual-channel and efficient OAM measurement. One of the most frequently 
used OAM measurement devices in the previous experiments is off-axis hologram 
gratings’'****", which typically have a practical efficiency (p) of about 30% (ref. 31). 
This low efficiency would cause an extremely low six-photon coincidence count rate 
that scales as p®, more challenging than the scaling (p’) in the previous two-photon 
OAM experiments. To overcome this challenge, we use two different types of device 
for efficient OAM readout. 

The first type is what we refer to as ‘dual-channel’ OAM measurement devices, 
used after the two beam splitters. The strategy is to transfer the OAM information 
to the photon’s SAM, and measure it using a PBS with two output channels. The 
method is as follows. After the beam splitter, each photon passes through a HWP 
and is prepared in the state (|0)*+|1)°)/V2. They are then sent into Sagnac 
interferometers with a Dove prism inside” (Fig. 2), which is placed at a 7/8 angle 
with respect to the interferometer plane. The photon is rotated by an angle of 1/4 
for the |0)* SAM component when passing through the Dove prism forwards and 
by an angle of -1/4 for the |1)° SAM component when passing through the Dove 
prism backwards. OAM states |0)° and |1)° with respective phases e'” and e~'”, 
where ¢ is the azimuthal angle in the polar coordinate system, which pass through 
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a Dove prism rotated by 1/8, will be rotated by an angle of 1/4 and their phases will 
change to e”—7/4) and e~*”—"/), respectively. Therefore, two opposite phases, 
e'/4 and e—'t/4, will be added to the two orthogonal OAM modes. Finally, the 
output photons are transformed into the|0)* and |1)* polarizations using a QWP. 
The overall transformations can be summarized as a CNOT gate between the 
OAM and SAM: 
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Effectively, the OAM qubit is deterministically and redundantly encoded by the 
SAM qubit, that is, the SAM becomes identical to the OAM state. Thus, by mea- 
suring the SAM using a PBS with two output channels, we can recover information 
about the OAM with a high efficiency of ~97%. In our experiment, four such 
Sagnac interferometers are used in the h-BSM. In this way, two of the four OAM 
Bell states can be discriminated using a beam splitter. 

Whereas the first type is like a two-channel readout device (like a PBS for SAM), 
the second type is like a one-channel readout device (like a polarizer for SAM), and 
is used in the final stage of state verification after teleportation. The strategy for the 
projective OAM measurement is to transform it into the zero-order OAM mode so 
that the photon can be coupled into a single-mode fibre, while all other higher- 
order OAM modes will be rejected. Here we use an SPP’ and a BPP” for efficient 
OAM readout. 

The SPP*? is designed with a spiral shape to create a vortex phase of e’””, where | 
is an integer and is referred to as topological charge. The SPP can attach vortex 
phases of e!”” and e~‘”” to a photon when it passes through the SPP forwards and, 
respectively, backwards. When vortex phases of e!”” and e~"” are attached, the 
corresponding OAM values will increase by and, respectively, decrease by lh. 
Therefore, a SPP can be used as an OAM mode converter. In our experiment, 
the OAM qubits are encoded in the OAM first-order subspace with the topological 
charge / being 1. The conversion between the OAM zero mode and the first-order 
modes (|0)° and |1)°) is realized using a 16-phase-level SPP with an efficiency of 
~97%. 

For the coherent transformation between the OAM zero mode and the super- 
position states (|0)° +|1)°)/V2 and (|0)° +i|1)°)/V2, a 2-phase-level BPP3*" is 
used with an efficiency of ~80%. These high-efficiency OAM measurement devices 
boost the sixfold coincidence count rate in our present experiment by more than 
two orders of magnitude, compared with the previous use of hologram gratings. 
Error budget. The sources of error in our experiment include double pair emis- 
sion in spontaneous parametric down-conversion; partial distinguishability of the 
independent photons that interfere at the PBS (~5%) and the beam splitters 
(~5%); state measurement error due to zero-order OAM leakage (~2%); fidelity 
imperfection of entangled photon pairs 2-3 (~5%) and 4-5 (~9%); and fidelity 
imperfection of the to-be-teleported single-photon hybrid entangled state (~8%). 

Some error sources affect all the teleported states. First, the double pair emission 
contributed a ~15% background to the overall sixfold coincidence counts. If this 
were subtracted, the average teleportation fidelity would be improved to ~0.74. 
Second, the imperfectly entangled photon pair 4-5 and the imperfect two-photon 
interference at beam splitters 1 and 2 (for the QND and OAM Bell-state measure- 
ments, respectively) degrade the teleportation fidelity for all states by ~ 13%. Third, 
the imperfect state measurements mainly due to the zero-order OAM leakage 
cause a degradation of ~2%. 

Some error sources can have different effects on different teleported states. The 
imperfect two-photon interference at the PBS degrades the teleportation fidelities 
for the states |g), |g) and |p), by ~5%. However, for the states |p) , and |) , 
where the photon is horizontally or vertically polarized, the actual teleportation 
does not require two-photon interference at the PBS, and is therefore immune to 
the imperfection of the interference. This explains why the teleportation fidelities 
for the states |g) , and |v), are the highest. This is inconsistent with the previous 
results: for the experiments using a PBS**~’, the teleportation fidelities in the 
horizontal-vertical basis are higher than those in the + 45° linear and circular 
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polarization bases; whereas for the experiments**”’ using a non-polarizing beam 
splitter, the teleportation fidelities in all polarization bases are largely unbiased. 

We note that the teleportation fidelity for the hybrid entangled state, |¢),., is the 
lowest, which is affected by the imperfection (~8%) in the state preparation of the 
initial state |g) , as well as by imperfections in |g), and |g) p, that is, essentially all 
error and noise present in the experiment. It can be expected that all entangled 
states are subject to the same decoherence mechanism as the state |g), as demon- 
strated, and should undergo similar reductions in fidelity. 

These sources of noise can in principle be eliminated in future by various meth- 

ods. For instance, deterministic entangled photons” do not suffer the problem of 
double pair emission. We also plan to develop bright OAM-entangled photons 
with higher fidelity, and a more precise 32-phase-level SPP for the next experiment 
of hyper-entanglement swapping. 
A universal scheme for teleporting NDoFs. We illustrate in Extended Data Fig. 2 
a universal scheme for teleporting N DoFs of a single photon. For simplicity, we 
discuss an example for three DoFs (Extended Data Fig. 2b), labelled X, Y and Z. 
There are in total 64 hyper-entangled Bell states: 
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These the products of the Bell states of each DoF, defined as 
|p) #* =((0) 10), + |1),|1),)/v2 
[)i* =(10);11);+11);10))/v2 


where i = X, Y, Z. The aim is to perform an h-BSM, identifying 1 out of the 64 
states. The required resources include photon pairs entangled in the Z DoF, pairs 
hyper-entangled in the Y-Z DoFs, pairs hyper-entangled in the X-Y-Z DoFs, 
filters for the three DoFs that can project the state to |0) or |1) (with a functionality 
similar to the polarizers for SAM), qubit flip (,.) operations for the DoFs, a 50:50 
non-polarizing beam splitter and single-photon detectors, all of which are com- 
mercially available or have been experimentally demonstrated previously. 

It has been known that if two single photons are superposed at a beam splitter, 
only asymmetric quantum states can result in one and only one photon exiting 
from each output of the splitter. In the simplest case of one DoF, this is the asym- 
metric |y) state. As now we have three DoFs, we have to consider the combined, 
molecular-like quantum states. There are in total 28 possible combinations that are 
asymmetric states, as follows: 
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After the photons have passed through the first beam splitter, we can filter out and 
retain these 28 from the 64 hyper-entangled Bell states conditioned on seeing one 
and only one photon in each output of the splitter. Next we apply two filters in the 
two output of the splitter to project the X DoF into |) * . One filter is set to pass the 
|1) state and the other is set to pass the |0) state. This results in the following 16 
states being filtered from the 28: 
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We perform a bit-flip operation on the X DoF on one of the arms of the inter- 
ferometer, erasing the information in the X DoF. We then pass them into the 
second beam splitter, filtering out and retaining the six asymmetric combinations 
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We emphasize that, before sending the photons into the second beam splitter, for 
the reason discussed in the main text we use teleportation-based QND measure- 
ment to ensure the two photons can be fed into the subsequent cascaded inter- 
ferometers. Here the QND should preserve the quantum information in the Y and 
Z DoFs. Thus, quantum teleportation of two DoFs of a single photon is required 
(Extended Data Fig. 2a), which is exactly what we demonstrated in the experiment 
presented in the main text. 

After that, we again pass the photons through two filters on the Y DoF, one set in 
the |1) state and the other set in the |0) state, which leaves four asymmetric 
combinations: 
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Similarly, we perform a bit-flip operation on the Y DoF on one of the arms, erasing 
the information in the Y DoF. We then doa QND measurement on the Z DoF, and 
pass the two photons into the third beam splitter, finally filtering out the only 
remaining asymmetric state: 


Wx @W)y @Wz 


By detecting one and only one photon in the output of the third splitter, we can 
thus discriminate this particular hyper-entangled state, |y/)¢ @|W)y @|w)z. 
from the 64 hyper-entangled Bell states on three DoFs. 

To experimentally demonstrate the teleportation of three DoFs, the scheme 

would need in total ten photons (or five entangled photon pairs from SPDC), 
which is within the reach of near-future experimental abilities, given the recent 
advances in high-efficiency photon collection and detection. It is obvious that 
the above protocol can be extended to more DoFs as displayed in Extended Data 
Fig. 2c. 
Feed-forward scheme for spin-orbit composite states. To realize a deterministic 
teleportation, feed-forward Pauli operations on the teleported particle based on 
the intrinsically random Bell-state measurement results are essential. In our pre- 
sent experiment, no feed-forward has been applied. Here we briefly describe how 
this can be done for the spin-orbit composite state. For the SAM qubits, active 
feed-forward has been demonstrated before using fast electro-optical modula- 
tors” (EOMs). To take advantage of this technology, which has been demon- 
strated to have high-speed operation and high gate fidelity, we use a coherent 
quantum SWAP gate between the OAM and SAM qubits. The SWAP gate is 
defined as 
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The operation sequence for the feed-forward operation on the spin-orbit com- 
posite states is shown in Extended Data Fig. 3a. First, the SAM feed-forward is 
done with an EOM. Second, the SAM and OAM qubits undergo a SWAP opera- 
tion. Third, an EOM is used to operate on the ‘new SAM that is converted from the 
OAM. We note that the EOM does not affect the OAM, and so the previous 
operation on the SAM is unaffected. Lastly, a final SWAP gate converts the SAM 
back to OAM, which completes the feed-forward for spin-orbit composite states. 

The SWAP gate is composed of three CNOT gates (Extended Data Fig. 3b). In 
the first and third CNOT gates (blue shading), the SAM qubit is the control qubit 
that acts on the OAM as target qubit, realized by sending a photon through, and 
recombined at a PBS. On the reflection arm only, a Dove prism is inserted to 
induce an OAM bit flip. In the second CNOT (yellow shading), the control and 
target qubits are respectively the SAM and OAM qubits, as explained in the 
Methods section ‘Dual-channel and efficient OAM measurement’. The extra 
Dove prism and HWP in front of the PBS are used to compensate for the phase 
shift inside the Sagnac interferometer. 


31. Graham, T. M., Barreiro, J. T., Mohseni, M. & Kwiat, P. G. Hyperentanglement- 
enabled direct characterization of quantum dynamics. Phys. Rev. Lett. 110, 
060404 (2013). 

32. Slussarenko, S. et al. The polarizing Sagnac interferometer: a tool for light orbital 
angular momentum sorting and spin-orbit photon processing. Opt. Express 18, 
27205-27216 (2010). 

33. Andrews, D. L. & Babiker, M. (eds) The Angular Momentum of Light (Cambridge 
Univ. Press, 2012). 


©2015 Macmillan Publishers Limited. All rights reserved 


34. 


35. 


36. 
37. 


Jack, B. etal. Precise quantum tomography of photon pairs with entangled orbital 
angular momentum. New J. Phys. 11, 103024 (2009). 

Pan, J.-W., Gasparoni, S., Aspelmeyer, M., Jennewein, T. & Zeilinger, A. Experimental 
realization of freely propagating teleported qubits. Nature 421, 721-725 (2003). 
Zhang, Q. et a/. Experimental quantum teleportation of a two-qubit composite 
system. Nature Phys. 2, 678-682 (2006). 

Yin, J. et al. Quantum teleportation and entanglement distribution over 100- 
kilometre free-space channels. Nature 488, 185-188 (2012). 


38. 


39. 


40. 
4l. 


LETTER 


Bouwmeester, D. eta/. Experimental quantum teleportation. Nature 390, 575-579 
(1997). 

Ma, X.-S. et a/. Quantum teleportation over 143 kilometres using active feed- 
forward. Nature 489, 269-273 (2012). 

Lu, C.-Y. & Pan, J.-W. Push-button photon entanglement. Nature Photon. 8, 
174-176 (2014). 

Prevedel, R. et al. High-speed linear optics quantum computing using active feed- 
forward. Nature 445, 65-69 (2007). 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


» 


360 


300 


240 


180 


120 


60 


Fourfold coincidence (300 s) 


0 
-480 -320 -160 


io” 


420 


0 160 


360 


300 


240 


180 


120 


60 


Fourfold coincidence (20 s) 


0 1 1 1 
-480 -320 -160 


oa 


480 
420 
360 
300 
240 
180 


120 


Fourfold coincidence (20 s) 


60 - 


0 
-480 


-320 


-160 


Extended Data Figure 1 | Hong—Ou-Mandel interference of multiple 
independent photons encoded with SAM or OAM. a, Interference at the PBS 
where input photons 1 and 2 are intentionally prepared in the states Dj r; Azlp 
(orthogonal SAMs; open squares) and D,r;D2/, (parallel SAMs; solid circles). 
The y axis shows the raw fourfold (the trigger photon t and photons 1, 2 and 3) 
coincidence counts. The extracted visibility is 0.75 + 0.03, calculated from 
V(0) = (C4 — C,)/(C+ + C,), where C;, and C+ are the coincidence counts 
without any background subtraction at zero delay for parallel and, respectively, 
orthogonal SAMs. The red and blue lines are Gaussian fits to the raw data. 

b, Two-photon interference on beam splitter 1, where photons 1 and 4 are 


0 160 320 480 


0 160 


Delay (um) 


prepared in orthogonal OAM states. The black line is a Gaussian fit to the raw 
data of fourfold (the trigger photon and photons 1, 4 and 5) coincidence counts. 
The visibility is 0.73 + 0.03, calculated from V(0) = 1 - Co/C.., where Cy and C.. 
is the fitted counts at zero and, respectively, infinite delays. c, Two-photon 
interference at beam splitter 2, where input photons 1 and 5 are prepared in the 
orthogonal OAM states. The black line is a Gaussian fit to the data points. 
The interference visibility is 0.69 + 0.03 calculated in the same way as in b. 
Error bars, 1 s.d., calculated from Poissonian counting statistics of the raw 
detection events. 
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Extended Data Figure 2 | A universal scheme for teleporting N DoFs of a 
single photons. a, A scheme for teleporting two DoFs of a single photon using 
three beam splitters, which is slightly different from the one presented 

in the main text using a PBS and two beam splitters. Through the first 

beam splitter, six asymmetric states,|) @{|¢)y .|¢)y |W)y } and 

{|¢)x .1¢)x 1H) ¢ }@|W)y. can result in one photon in each output, which is 
ensured by teleportation-based QND on the Y DoF. After passing the two 
photons through the two filters that project them into the|1) and |0) states for 
the X DoF, four states, |) @{|)y .|)y |W)¥ } and |W) @|W)y . survive. 
Through the second beam splitter, only the asymmetric state |) ofthe Y DoF 
can result in one photon in each output. Finally we can discriminate the state 


Alice 


Hyper-entanglement | 


Single photon state 
source inN DOFs 


encoded in NDOFs 


|) ®|W)y from the 16 hyper-entangled Bell states. b, Teleportation of three 
DoFs of a single photons (Methods). Note that to ensure that there is one and 
only one photon in the output of the first beam splitter, we can use the 
teleportation-based QND on two DofFs in a (dashed circle). c, Generalized 
teleportation of N DoFs of a single photons. The h-BSM on N DoFs can be 
implemented as follows: (1) the beam splitter post-selects the asymmetric 
hyper-entangled Bell states in N DoFs which contain an odd number of 
asymmetric Bell states in one DoF, (2) two filters and one bit-flip operation 
erase the information on the measured DoF and further post-select asymmetric 
states, and (3) teleportation-based QND. 
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Extended Data Figure 3 | Active feed-forward for spin-orbit composite 
states. a, The active feed-forward scheme. This composite active feed-forward 
could be completed in a step-by-step manner. First, we use an EOM to 
implement the active feed-forward for SAM qubits. It is important to note that 
EOM does not affect OAM. Second, we use a coherent quantum SWAP gate 
between the OAM and SAM qubits. The original OAM is converted into a ‘new’ 
SAM, whose active feed-forward operation is done by a second EOM. Then the 
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OAM and SAM qubits undergo a second SWAP operation and are converted 
to the original DoFs. b, The quantum circuit for a SWAP gate between the 
OAM and SAM qubits. The SWAP gate is composed of three CNOT gates: in 
the first and third CNOT gates, the SAM and OAM qubits act as the control 
and target qubits, respectively, whereas in the second CNOT gate this is 
reversed. 
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Dynamically reconfigurable complex emulsions via 
tunable interfacial tensions 


Lauren D. Zarzar', Vishnu Sresht?, Ellen M. Sletten’, Julia A. Kalow', Daniel Blankschtein? & Timothy M. Swager' 


Emulsification is a powerful, well-known technique for mixing and 
dispersing immiscible components within a continuous liquid phase. 
Consequently, emulsions are central components of medicine, food 
and performance materials. Complex emulsions, including Janus 
droplets (that is, droplets with faces of differing chemistries) and 
multiple emulsions, are of increasing importance’ in pharmaceuti- 
cals and medical diagnostics’, in the fabrication of microparticles 
and capsules*~* for food’, in chemical separations’, in cosmetics*, and 
in dynamic optics’. Because complex emulsion properties and func- 
tions are related to the droplet geometry and composition, the devel- 
opment of rapid, simple fabrication approaches allowing precise 
control over the droplets’ physical and chemical characteristics is crit- 
ical. Significant advances in the fabrication of complex emulsions 
have been made using a number of procedures, ranging from large- 
scale, less precise techniques that give compositional heterogeneity 
using high-shear mixers and membranes”, to small-volume but more 
precise microfluidic methods'’”. However, such approaches have yet 
to create droplet morphologies that can be controllably altered after 
emulsification. Reconfigurable complex liquids potentially have great 
utility as dynamically tunable materials. Here we describe an approach 
to the one-step fabrication of three- and four-phase complex emul- 
sions with highly controllable and reconfigurable morphologies. 
The fabrication makes use of the temperature-sensitive miscibility 
of hydrocarbon, silicone and fluorocarbon liquids, and is applied 
to both the microfluidic and the scalable batch production of com- 
plex droplets. We demonstrate that droplet geometries can be alter- 
nated between encapsulated and Janus configurations by varying 
the interfacial tensions using hydrocarbon and fluorinated surfac- 
tants including stimuli-responsive and cleavable surfactants. This 
yields a generalizable strategy for the fabrication of multiphase emul- 
sions with controllably reconfigurable morphologies and the poten- 
tial to create a wide range of responsive materials. 

Phase separation approaches using mass transfer of a co-solvent or 
separating agent’*'* have attracted interest as simplified routes to the 
fabrication of complex emulsions. In designing our new method, we 
use the facts that fluorocarbons are lipophobic as well as hydrophobic 
and that many fluorocarbon and hydrocarbon liquids are immiscible 
at room temperature but have a low upper consolute temperature (T.) 
and mix with gentle heating”. Hexane and perfluorohexane, for example, 
have a T, of 22.65 °C (ref. 18). Our interest in fluorocarbons and their 
emulsions was also based on the fact that they are inert materials with 
unique properties that have been exploited for use as magnetic resonance 
imaging and ultrasound contrast agents, as artificial blood, in water- 
repellent surfaces, and in the acoustically triggered release of payloads'”"”°. 
To explore the feasibility of using a temperature-induced phase sepa- 
ration route to complex emulsions (Fig. 1a), we emulsified a 1:1 volume 
ratio of hexane and perfluorohexane above T, in an aqueous solution 
of Zonyl FS-300 (hereafter ‘Zonyl’), which is a nonionic fluorosurfac- 
tant with the linear chemical formula F(CF2),,CH2CH,0(CH,CH,0),H. 
Cooling below T, induced phase separation and yielded structured com- 
plex droplets (Fig. 1b). These complex emulsions were readily produced 


in bulk by shaking warm hexane-perfluorohexane liquid in a surfac- 
tant solution (Fig. 1c). Although these droplets were polydisperse, the 
morphology and composition of the droplets were highly uniform. Chem- 
ical partitioning during phase separation” gave directed compartmen- 
talization of solutes (Fig. 1d). Therefore, temperature-induced phase 
separation of liquids provides a simple, scalable approach to the fabri- 
cation of complex functional emulsions. 

The morphology of the complex droplets is exclusively controlled by 
interfacial tension. To put this in an analytical context, consider a com- 
plex emulsion of any immiscible liquids F and H (at a given volume 
ratio) ina third immiscible liquid W. We choose the interfacial tensions 
of the H-W interface, jy, the F—W interface, yp, and the F-H interface, 
Yrrp such that yp and yy are significantly larger than ypy. This regime is 
relevant to combinations of liquids H and F that have low interfacial 
tension just below T,. It can be shown” that these multiphase droplets 
are nearly spherical in shape and will adopt one of the following three 
thermodynamically permissible internal configurations: (1) liquid H 
completely encapsulates liquid F, (2) liquids H and F form a Janus 
droplet, and (3) liquid F completely encapsulates liquid H (Fig. 2a). 
These droplet configurations are characterized by two contact angles, 
01, between the H-W and F-H interfaces, and 0: between the F-W 
and F-H interfaces. The three interfacial tensions acting along the 


Figure 1 | Temperature-controlled phase separation of hydrocarbon and 
fluorocarbon liquids can be used to create complex emulsions. a, Complex 
emulsion fabrication. b, Above T,, hexane and perfluorohexane are miscible 
and emulsified in aqueous 0.1% Zonyl (top left). Below T., hexane and 
perfluorohexane phases separate to create a hexane-in-perfluorohexane-in- 
water (H/F/W) double emulsion (bottom right). Hexane is dyed red. Scale bar, 
200 pm. c, Emulsions of uniform composition made by bulk emulsification 
(such as shaking). Scale bar, 100 jum. d, Lateral confocal cross-section of H/F/W 
double-emulsion droplets. Hydrocarbon-soluble Nile Red dye (green) 
selectively extracts into hexane. Rhodamine B dyes the aqueous phase (red). 
Scale bar, 100 j1m. Monodisperse droplets in b and d were made using a 
microcapillary device. 
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Figure 2 | Reconfiguration of droplet morphology is dynamic and results 
from changes in the balance between interfacial tensions. a, Sketch of the 
effect of interfacial tensions on the configuration of a complex droplet. In (1), 
Yr > Yu + ru» favouring encapsulation of phase F within phase H. In (3), 

Yu > Yr + ypu and phase H is encapsulated within phase F. At intermediate 
values of y; and y;;, a Janus droplet with geometry typified by (2) is formed. 
Ym Yu and ypy can be reconfigured into a Neumann triangle” solvable for 0); 
and 0. b, Hexane-perfluorohexane droplets reconfigure in response to 
variation in the concentration of Zonyl as it diffuses through 0.1% SDS 

from right (higher Zonyl concentration) to left (lower Zonyl concentration). 
Scale bar, 100 jum. c, Configurational stability diagram for the 


interfaces must be in equilibrium for the droplet configuration to be 
stable, as can be expressed by the following equations: 


B= 59 2 
Ye— YH — YEH 


cos (Oy) = Ter 
FHYH 


(1) 


2 P= 29 
cos (Oz) = YH — VF YEH (2) 
2) PHYE 
Figure 2a shows that configurations (1) and (3) are the limiting cases 
of configuration (2) as 6;;—> 0 and 0:—> 0, respectively. Equations (1) 
and (2) can be used to translate these limiting contact angles into inter- 
facial tension conditions, yielding the following two relationships: 


Oy =0 


> Ye=Vu + YEH 


(3) 
(4) 


Recast as the difference between 1; and yp, equations (3) and (4) indi- 
cate that when jp — yy = Yen the droplets assume configuration (1) in 
Fig. 2a. Conversely, when yy — )p = Ypy the droplets adopt configura- 
tion (3) in Fig. 2a. However, when the difference between yy and yr is 
of the order of ypy; the droplets adopt a Janus droplet geometry associ- 
ated with configuration (2) in Fig. 2a. 

These physical relationships reveal that, given a low value of ypy, only 
slight changes in the balance of jy and yx are necessary to induce dra- 
matic changes in the droplet’s morphology. We proposed that if liquids 
F, H and W were fluorocarbon, hydrocarbon and water, respectively, 
dynamic reconfiguration of the droplets might be triggered by hydro- 
carbon and fluorinated surfactants”. Consistently, a 1:1 volume mixture 
of hexane and perfluorohexane in 0.1% Zonyl generated hexane-in- 
perfluorohexane-in-water (H/F/W) double emulsions (Fig. 1), indicat- 
ing a preferential decrease in yp. In comparison, emulsification of the 
same hexane-perfluorohexane mixture in water containing 0.1% sodium 
dodecyl sulphate (SDS, an anionic hydrocarbon surfactant), yielded 
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hexane-perfluorohexane-water system showing yp — yy as a function of the 
fraction of 0.1% SDS, fsps, where the other fraction is 0.1% Zonyl. The green 
band denotes the region |yp — yy| <p = 1.07 0.1 mN m |, obtained from 
geometrical analysis of Janus droplets. The red dashed lines correspond to 
YpH =0.4mNm as predicted for a temperature of 10 °C (ref. 23), the 
approximate temperature at which the droplets in d were imaged. Filled and 
unfilled triangles indicate conditions under which Janus droplets and double 
emulsions were observed, respectively. Labels I-VII correspond to the 
droplets in d. d, Optical micrographs of hexane-perfluorohexane droplets in 
solutions of 0.1% Zonyland 0.1% SDS in varying ratios as plotted in c. Hexane is 
dyed and appears grey. Scale bars, 50 um. 


F/H/W double emulsions as a result of a preferentially decreased yy. 
When we introduced a small volume of 10% Zonyl into F/H/W drop- 
lets in 0.1% SDS, we observed that the droplet morphology dynam- 
ically changed in accordance with the concentration gradient of Zonyl. 
Droplets first passed through a spherical Janus drop morphology before 
inverting to H/F/W double emulsions (Fig. 2b and Supplementary 
Video 1). The reverse was seen when concentrated SDS was added to 
0.1% Zonyl-stabilized droplets (Supplementary Video 2). These results 
suggest that not only is droplet morphology highly controllable on initial 
emulsification, by the choice of surfactants, but that these emulsions are 
also dynamically and reversibly reconfigurable by changing the balance 

To validate the proposed dynamic mechanism, we measured the ten- 
sions of hexane-water (;;) and perfluorohexane-water ()p) interfaces 
for a variety of 0.1% SDS and 0.1% Zony] ratios using the pendant-drop 
method (Methods, Extended Data Fig. 1 and Extended Data Table 1). 
The perfluorohexane-hexane interfacial tension, jy, has been esti- 
mated previously” and is in general agreement with our jy estimates 
based on geometrical analysis of the Janus drops produced (Methods, 
Extended Data Fig. 2 and Extended Data Table 2). We can use the quan- 
tity yp — yp as a simple indicator of droplet configuration, and this 
quantity is plotted as a function of surfactant ratio in Fig. 2c. When only 
0.1% SDS was used (fgps = 1), we measured yp — )yq > Ypy and observed 
complete encapsulation of perfluorohexane by hexane. In a mixed SDS- 
Zonyl composition, the trajectory entered the narrow green zone (Fig. 2c) 
corresponding to |yp — Yx| << yeu. Within this Janus droplet config- 
urational zone, the droplet morphology began to ‘flip’ as visualized in 
Fig. 2d. As the proportion of 0.1% SDS approached zero and fluoro- 
surfactant dominated, the trajectory left the Janus zone and the droplets 
assumed a configuration in which hexane was encapsulated by per- 
fluorohexane. Overall, the observed droplet geometries closely followed 
the predicted morphological trend and confirm that variations in hydro- 
carbon and fluorinated surfactants are effective for the manipulation 
of the balance between yz and x. 
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This understanding has made it possible for us to induce droplet 
morphological transitions with stimuli-responsive and cleavable sur- 
factants (Fig. 3a). Responsive surfactants** undergo reversible changes 
in their effectiveness” when triggered by stimuli such as magnetic fields, 
pH, CO; or light. To create optically sensitive droplets, we synthesized 
a light-responsive surfactant” consisting of an azobenzene moiety that 
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Figure 3 | Emulsions reconfigure in response to light and pH. a, Sketch of 
how the variations in yg and jy induced by alterations in the effectiveness of a 
hydrocarbon surfactant translate into differences in drop morphology on a 
phase-stability diagram. Grey represents hexane and white represents 
perfluorohexane. b, Chemical structure of the light-responsive surfactant 
which reversibly isomerizes under ultraviolet (UV) and blue light between the 
more effective trans form of the surfactant (left) and the less effective cis form 
(right). Aligned beneath are optical micrographs of hexane-perfluorohexane 
emulsions that are tuned to undergo specific morphological transitions in 
response to light. Hexane is dyed red, and the aqueous phase consists of Zonyl 
and the light-responsive surfactant pictured. Top: droplets undergo complete 
inversion. Middle: F/H/W double-emulsion drops transition to Janus droplets. 
Most droplets are viewed from the top, but one is lying on its side allowing a 
view of the droplet profile. Bottom: Janus droplets transition to an H/F/W 
double emulsion. Scale bar, 100 jim. c, A pH-responsive surfactant, 
N-dodecylpropane-1,3-diamine, is used in combination with Zonyl to create 
pH-responsive droplets. Acid diffuses through the solution from right (higher 
concentration) to left (lower concentration), reducing the pH below pK, = 4.7 
and thereby generating a less effective surfactant” and inducing inversion of 
the emulsion from F/H/W to H/F/W. Scale bar, 100 tm. d, Emulsions stabilized 
by a combination of Zonyl and acid-cleavable surfactant, sodium 2,2- 
bis(hexyloxy)propyl sulphate’*, undergo morphological changes as the 
cleavable surfactant is degraded over time at pH = 3. Scale bar, 50 jum. 


reversibly undergoes a photo-induced isomerization between a more 
effective trans configuration and a less effective cis configuration (Fig. 3b). 
When this surfactant was used in combination with Zonyl, we observed 
that the hexane-perfluorohexane droplets rapidly and reversibly changed 
morphology in response to ultraviolet (wavelength, 2 = 365 nm) and 
blue (2 = 470 = 20 nm) light. Depending on the relative concentrations 
of Zonyl and the light-responsive surfactant and on the duration or 
intensity of light exposure, we tuned the morphology to switch between 
the double-emulsion and Janus states or to invert entirely (Supplemen- 
tary Videos 3 and 4). Analogous results were achieved with Zonyl anda 
pH-responsive surfactant, N-dodecylpropane- 1,3-diamine, by altern- 
ating the pH between values above and below the surfactant’s lowest 
pK, value, 4.7 (ref. 27; Fig. 3c). Similarly, cleavable surfactants show a 
reduction in efficacy when irreversibly degraded by exposure to light, 
heat or changes in pH. In this context, we demonstrated irreversible tran- 
sition between the F/H/W and H/F/W double-emulsion states (Fig. 3d) 
using an acid-cleavable surfactant”’, sodium 2,2-bis(hexyloxy)propyl 
sulphate, in conjunction with Zonyl. The results shown here demon- 
strate the versatility of the complex fluorocarbon-hydrocarbon drop- 
lets’ response to a wide range of stimuli. 

Liquid droplets and solid particles with asymmetric properties were 
created by effecting different chemistries in the separate compartments 
of a fluorocarbon-hydrocarbon, or more generally fluorous-organic, 
Janus droplet. To create directionally orientable and movable liquid 
Janus droplets, we synthesized magnetic Fe;O, nanoparticles stabilized 
with oleic acid for preferential partitioning into the organic phase. Janus 
droplets with hemispheres of ethyl nonafluorobutyl ether (the fluor- 
ous phase) and dichlorobenzene with Fe3O0, (the organic phase) were 
rapidly oriented and moved in the direction of a magnet (Fig. 4a). To 
generate solid hemispherical particles, we polymerized an emulsion 
consisting of a liquid polymer precursor, 1,6-hexanediol diacrylate, as 
the organic phase and methoxyperfluorobutane as the fluorous phase 
(Fig. 4b). By replacing methoxyperfluorobutane with a fluorinated acry- 
late oligomer and crosslinker, we created spherical solid Janus particles 
with fluorinated and non-fluorinated sides (Fig. 4c). 

The same principles of droplet transformations observed in three- 
phase emulsions were extended to a four-phase system, thereby gen- 
erating reconfigurable droplets of even higher-order complexity. We 
designed a system comprised of silicone oil (Si), hydrocarbon oil (H, 
mineral oil and octadecane) and fluorinated oil (F, ethyl nonafluorobutyl 
ether) such that the liquids mixed when heated and separated into three 
phases at room temperature (20 °C). Systematically varying ratios of 
1% Zonyl and 1% SDS aqueous surfactant solutions caused droplets to 
assume morphologies combining both Janus configurations (indicated 
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Figure 4 | Magnetic complex emulsions, complex emulsions as templates 
and four-phase emulsions. a, Janus droplets of ethyl nonafluorobutyl ether 
and dichlorobenzene containing magnetite nanoparticles in the 
dichlorobenzene phase are oriented with a magnet. Scale bar, 200 um. 

b, Scanning electron micrograph of hemispherical particles made from 
photopolymerized Janus droplets containing hexanediol diacrylate and 
methoxyperfluorobutane. Scale bar, 100 jum. c, Top: scanning electron 
micrograph of a Janus particle with hydrocarbon and fluorinated polymeric 
hemispheres. Bottom: the energy-dispersive X-ray spectral map reveals the 
fluorinated hemisphere. Scale bar, 50 jim. d, Four-phase emulsions reconfigure 


using ‘|’) and encapsulated configurations (indicated using ‘/’) (Fig. 4d). 
For example, a triple emulsion H/Si/F/W was formed in 1% Zonyl, but 
in a 3:2 ratio of 1% Zonyl:1% SDS we observed a Janus configuration 
between the fluorous and silicone phases while the hydrocarbon phase 
remained encapsulated in the silicone phase, H/Si|F/W. As the pro- 
portion of SDS increased, the droplet passed through Janus droplet and 
mixed Janus-encapsulated droplet configurations before inverting to 
the reverse triple emulsion (F/Si/H/W) in 1% SDS. 

Complex droplets of controllable composition and dynamic recon- 
figurable morphology provide a new active element for novel and existing 
applications of emulsions. The fabrication method and the dynamic 
mechanism presented are general and can be broadly applied using a 
wide variety of chemicals, materials and surfactants well beyond the 
initial demonstrations described here. Droplets triggered by environ- 
mental stimuli could be used, for example, to target the release of drugs 
at tumours, to induce changes in colour or transparency for camouflage, 
as vehicles for the sequestration of pollutants, as tunable lenses, or as 
sensors. Emulsions with the characteristic ability to selectively ‘present’ 
and ‘hide’ specific liquid interfaces and controllably alter droplet mor- 
phology and symmetry will find abundant applications. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 

Chemicals. The following chemicals were used as received: sodium dodecyl sulphate 
(=99%), Sudan Red 7B (95%), 1,6-hexanediol diacrylate (80%), 1,4-butanediol 
diacrylate (90%), trimethylolpropane ethoxylate triacrylate (M, = 428 g mol‘), 
1,2-dichlorobenzene (99%), Zonyl FS-300 (40% solids), methoxyperfluorobutane 
(99%), mineral oil (light), iron(m) chloride tetrahydrate (99.99%), 2,2,3,3,4,4,5,5,6, 
6,7,7,8,8,9,9-hexadecafluorodecane-1,10-diol (97%), methacrylic acid (99%) and 
octadecane (99%) (Sigma-Aldrich); silicone oil (for oil baths —40 to +200 °C), 
perfluorohexanes (98%), hexanes (98%), and iron(11) chloride (98%) (Alfa Aesar); 
ethyl nonafluorobutyl ether (>98%) and sodium oleate (>97%) (TCI); Darocur 
1173 (Ciba); fluorinated acrylate oligomer (Sartomer); Nile Red (99%) (Acros); 
N-dodecylpropane-1,3-diamine (>95%) (Matrix). Light-responsive surfactant was 
made by the literature procedure**’. Fluorinated coumarin dye was made by the 
literature procedure’. Cleavable surfactant sodium 2,2-bis(hexyloxy)propyl sul- 
phate was made by the literature procedure”*. 

Synthesis of the fluorinated crosslinker. To synthesize the fluorinated crosslinker 
2,2,3,3,4,4,5,5,6,6,757,8,8,9,9-hexadecafluorodecane-1,10-diyl bis(2-methylacrylate), 
a mixture of 2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9-hexadecafluorodecane-1,10-diol (10 g, 
22 mmol, 1 equiv.), methacrylic acid (20 ml, 240 mmol, 11 equiv.), 2,6-di-tert-butyl- 
4-methylphenol (200 mg, 0.9 mmol, 0.04 equiv.) and sulphuric acid (0.1 ml, 18 M) in 
toluene (250 ml) was heated to reflux with azeotropic removal of water (Dean-Stark 
trap). After three days, the mixture was cooled to room temperature and washed 
with saturated aqueous sodium bicarbonate (3 X 300 ml). The toluene was dried 
with MgSO, and evaporated. The remaining residue/product was dissolved in 
perfluorohexanes (150 ml) and filtered. The filtrate was evaporated to yield 7.81 g 
of colourless liquid (13.1 mmol, 61% yield). "H-NMR (400 MHz, CDCI,): 6 6.22 
(p, J= 1.1 Hz, 2H), 5.70 (p, J= 1.5 Hz, 2H), 4.65 (t, J= 13.3 Hz, 4H), 1.98 (t, 
J = 1.3 Hz, 6H). C-NMR (101 MHz, CDCI;): 5 165.8, 135.0, 127.9, 117.5-107.5 
(m, CF), 60.1 (t, J= 27.3 Hz), 18.1. '°F-NMR (376 MHz, CDCl;): 6 —119.3 (p, 
J = 13.3 Hz, 4F), —121.8-—122.0 (m, 8F), —123.3 (bs, 4F). LRMS (EI): calculated 
for CigH,4F 1604 [M]*, 598; found, 598. NMR spectra were obtained on a Bruker 
Avance 400 MHz spectrometer. LRMS was acquired on an Agilent 5973N GCMS. 
Please see Extended Data Fig. 3 and Extended Data Fig. 4 for reaction scheme and 
NMR spectra. 

General fabrication of complex emulsions. The hydrocarbon and fluorocarbon 
liquids of choice were heated until miscible and emulsified. The temperature required 
varied depending on the solutions. Solutions were emulsified either in bulk by shak- 
ing or by coaxial glass capillary microfluidics and cooled to induce phase separation. 
For hexane-perfluorohexane emulsions, the emulsions were chilled on ice before 
imaging and often imaged while immersed in a cool water bath to maintain a 
temperature below 20 °C. For microfluidics, Harvard Apparatus PHD Ultra syringe 
pumps were used to inject the outer phase and inner phase using a glass capillary 
microfluidic device made from an outer square capillary (outer diameter, 1.5 mm; 
inner diameter, 1.05 mm; AIT Glass) and inner cylindrical capillary (outer diameter, 
1 mm; World Precision Instruments) pulled to a 30 jum tip using a P-1000 Micro- 
pipette Puller (Sutter Instrument Company). The microfluidic set-up was heated 
above the T, of the inner phase solution using a heat lamp. Emulsions were then 
cooled below T, to induce phase separation. Emulsions were observed to be stable 
during the time periods used (of the order of days). Longer-term stability experi- 
ments were not conducted. 

Measurement of interfacial tensions. Interfacial tension measurements were 
made using the pendant-drop method (ramé-hart Model 500 Advanced Gonio- 
meter). Measurements on a drop were taken every 5s until the interfacial tension 
appeared to be nearing equilibrium or the droplet became unstable. The hexane- 
water interfacial tension was measured to be 50 mN m_' and the perfluorohexane- 
water interfacial tension was measured to be 55mNm_'. 

Microscopy. Lateral confocal cross-sections of the droplets were imaged using a 
Nikon 1AR ultrafast spectral scanning confocal microscope. Scanning electron 
microscopy was conducted on gold-sputtered samples with a JEOL 6010LA scan- 
ning electron microscope. Fluorescence and bright-field images were taken with a 
Zeiss Axiovert 200 inverted microscope equipped with a Zeiss AxioCam HRc camera. 
Droplets typically orient themselves with the denser, fluorous phase downward. 
To take side-view images of the drops, emulsions were shaken to induce the drops 
to roll around while images were made with a 1 ms exposure. 

Fabrication of magnetic Janus droplets. Magnetite nanoparticles were made as 
follows: 25 ml of concentrated NH3OH was added to an acidified solution of 1.6 g 
of FeCl; and 1 g of FeCl,*4H,O in 50 ml of water at 80 °C. The magnetite nanopar- 
ticle precipitate was collected with a magnet, washed with water and redispersed. 
One gram of sodium oleate in 10 ml of water was added while stirring at room tem- 
perature. The oily black precipitate was extracted with hexanes. The solid was collected 
by evaporation of solvent, and was subsequently redispersed in dichlorobenzene. 
Janus droplets were obtained by heating the nanoparticle-dichlorobenzene solution 
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and ethyl nonafluorobutyl ether above T, and shaking in 0.2% SDS and 0.2% Zonyl 
in a 2.5:1 ratio. The drops were oriented using a neodymium magnet. 
Fabrication of light-responsive emulsions. Hexane dyed with Sudan Red 7b and 
perfluorohexane in equal volumes were used as the inner phase, and a mixture of 
0.1% light-sensitive surfactant and 0.1% Zonyl FS-300 in an 8:2 ratio was used as 
surfactants in the aqueous phase. Slight adjustments to the surfactant concentra- 
tions were made during imaging to tune the droplet morphological transitions to 
achieve the desired outcome; for example, more Zony] was needed to generate drop- 
lets that transitioned from a Janus droplet to a hexane-perfluorohexane-water 
double emulsion. A mercury lamp was used as the intense light source, and DAPI 
and FITC filters were used to selectively allow ultraviolet (A = 365 nm) and blue 
(A = 470 + 20 nm) light to reach the sample while imaged on an inverted microscope. 
Fabrication of reversibly responsive pH-sensitive emulsions. Hexane and perfluo- 
rohexane in equal volumes were emulsified in a solution of 2 mM N-dodecylpropane- 
1,3-diamine and 1.2% Zonyl in 0.2 M NaCl. Salt solution was used to control for 
effects of ionic strength on the emulsion morphology. The pH was adjusted to 
above and below pH 4.7 by addition of HCl and NaOH. 
Fabrication of emulsions using an acid-cleavable surfactant. Hexane and per- 
fluorohexane in equal volumes were emulsified in a solution of 0.3% sodium 2,2- 
bis(hexyloxy)propy] sulphate and 0.4% Zonyl in 0.1 M NaCl. The pH was adjusted 
to 3 using HCl and the solution was allowed to sit undisturbed on the inverted 
microscope stage while images were periodically made over an hour. 
Fabrication of hemispherical particles. 1,6-hexanediol diacrylate with 4% Darocur 
1173 photo-initiator was heated with an equal volume of methoxyperfluorobutane 
above T, and emulsified. 1% SDS and 1% Zonyl in a 3:2 ratio yielded Janus droplets 
which were then polymerized under a Dymax Blue Wave 200 ultraviolet lamp while 
kept cold on ice. 
Fabrication of fluorous-hydrocarbon Janus particles. Fluorinated acrylate oli- 
gomer, fluorinated crosslinker, 1,4-butanediol diacrylate and trimethylolpropane 
ethoxylate triacrylate were used in a volume ratio of 15:3:14:10. 5% Darocur 1173 
was used as the photo-initiator. The inner phase mixture was heated above T,, emul- 
sified in 1% SDS and polymerized with ultraviolet light over ice. 
Fabrication of four-phase emulsions. Light mineral oil with 20 wt% octadecane 
(used to reduce T, in a mixture with the other liquids), silicone oil and ethyl nona- 
fluorobutyl ether were used as the inner phases in a volume ratio of 6:7:13. The 
mineral oil and ethyl nonafluorobuty] ether both partition into the silicone oil such 
that on phase separation the silicone oil phase is enriched with some quantity of 
the two other phases. Aqueous mixtures of varying ratios of 1% Zonyl and 1% SDS 
were used as the outer phase, and emulsions were formed in bulk by shaking. 
Estimation of equilibrium interfacial tensions in the presence of surfactants. 
Experimentally, it proved difficult to measure the interfacial tensions with the 
pendant-drop method accurately for very long periods of time while maintaining 
a stable drop volume because surfactant continued to adsorb at the droplet inter- 
face, further reducing the already low interfacial tensions. Therefore, it is likely that 
the final interfacial tensions, )’, recorded for the hexane-water interface and, respec- 
tively, the perfluorohexane-water interface for different surfactant compositions, 
fsps) as illustrated in Extended Data Fig. 1a, are not the equilibrium interfacial ten- 
sions, Yeqp, for that system. In the absence of additional experimental data for long 
time scales, we used theories of dynamic interfacial tension to estimate qb. 
Rosen suggested** that the dynamic interfacial tension of surfactants could be 
accurately modelled using an empirical model of the form 


Yo —Yeqb . (5) 
1+ (t/t*) 
where 7/9 = y(t = 0) and ¢* and n are positive, empirically determined constants. 
The value of n depends on the type of surfactant and the interface considered, as 
well as on the concentration of the surfactant, and must therefore be independently 
determined for every set of experimental conditions. The constant ¢* is the half-life 
of the interfacial tension decay process—it is the time taken for the difference, 
Ay = y(t) — Yeqp to decrease to half its initial value. The three unknown model param- 
eters (/eqp, t* and n) can be determined from a least-squares fit of equation (5) to 
the experimental dynamic interfacial tension data. The measure of the “goodness- 
of-fit’ of equation (5) to the experimental data is quantified by the coefficient of 
determination, or R?, value™. The closer this value is to 1, the better the fit. 

An example of the fit of the model to the dynamic interfacial tension data is 
shown in red in Extended Data Fig. 1a. The parameter results of using equation (5) 
to fit the interfacial tension data for all the surfactant systems considered are pre- 
sented in Extended Data Table 1. The smallest value of R? obtained for the experi- 
mental data across all the systems studied was 0.909, indicating very good agreement 
between the predictions made using equation (5) and the experimental data gen- 
erated using the goniometer. 

The resulting extracted values of y.qp have been used to plot the variation in the 
interfacial tensions jy; and yy as a function of the 0.1% SDS fraction, fens, in Fig. 2 


2(t) = Yego + 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


and Extended Data Fig. 1b. We note that some droplets fall just outside the desig- 
nated Janus region. Discrepancies may be due to our inability to measure and esti- 
mate the equilibrium interfacial tensions with the 0.1 mN m! accuracy required, 
or there may be slight mixing of hexane and perfluorohexane in the droplets, which 
would alter the interfacial tensions relative to those of the pure liquids. 
Estimation of 7p from the analysis of Janus droplet images. Bulk mixtures of 
hexane and perfluorohexane have an upper consolute temperature, T,, of approxi- 
mately 22 °C, with a critical density of 1.14g cm * anda volume ratio very close to 
1:1 (ref. 18). The interfacial tension of the hexane-perfluorohexane interface at 
temperatures close to (but below) this upper consolute temperature is close to zero, 
and, consequently, it is difficult to measure accurately using conventional labora- 
tory techniques such as pendant-drop or Du Nuoy ring tensiometry. The best esti- 
mate in the physical chemistry literature comes from the predictions of a model for 
capillary fluctuations at an interface fitted to X-ray reflectivity measurements”. In 
ref. 23 it is proposed that the variation in the interfacial tension jp, (in units of 
mN mm ‘) with temperature T (in units of K) is given by 


»(T) =23.2 (= “) . (6) 


c 


Equation (6) predicts that at a temperature of 283.15 K (10°C), the hexane- 
perfluorohexane interfacial tension should be 0.4 mN m_'. We were unable to find 
other experimental data in the literature to support this estimate. However, given 
the equilibrium interfacial tension values and images of the resulting Janus drop- 
lets, it was possible to also estimate py for our system. 

The relationships between the shapes of Janus droplets and the ratio of the vol- 
umes of the two constituent phases, k, and the relative magnitudes of the interfacial 
tensions operating at the various interfaces of the droplet have been examined in 
ref. 21. The theoretical treatment there can be used to relate the radii of curvature 
of the F-H, H-W and F-W interfaces to the volume ratio and the interfacial ten- 
sions yy, Yp and Ypy: 

Ry Re Rev 

Given the interfacial tension values at the H-W and F-W interfaces (yy and yz, 
respectively), equation (7) can be used to determine ypy if we can independently 
calculate the three radii of curvature Ry, Rp and Rey. 

The shape ofa typical hexane—-perfluorohexane Janus droplet in water is illustrated 
in Extended Data Fig. 2a. The droplet’s interfaces are spherical arcs with different 
radii of curvature (denoted by Ry, Rp and Ry). Furthermore, the line of contact 
between the two phases on the outer (water-facing) surface of the droplet is a circle 
whose diameter, D, is indicated by the dotted line in Extended Data Fig. 2a, b. The 
droplet in Extended Data Fig. 2a can be deconstructed into three spherical caps 
of the type shown in Extended Data Fig. 2b, each with the same base—a circle of 
diameter D—and respective radii Ry, Rp and Rpy. The volumes of these three spher- 
ical caps are denoted by Veap(Ru» D), Veap(Re, D) and Veap(Reu, D), respectively. 

From the definition of the volume ratio, k, between the hexane and perfluoro- 
hexane phases, it follows that: 


Yu YE = YFH (7) 


Ve Veap(Re,D) — Veap(Re,D) 
Vu Veap(Ru,D) + Veap(Ret,D) 


(8) 


The volume of a spherical cap, V-ap(r, d), can be calculated using the principles of 
elementary three-dimensional geometry*’. Knowledge of the values of k, Ry, Rr and 


D allows us to solve equation (8) for the radius of curvature of the F-H interface, 
Rey. Equation (7) can then be used to solve for the missing interfacial tension, ypy. 

The two radii of curvature, Ry and Rp, and the diameter, D, of the circle of 
contact can be computed by analysing the images of the Janus droplets. A schem- 
atic of the analysis process is shown in Extended Data Fig. 2c-e. A raw image of a 
Janus droplet (Extended Data Fig. 2c) is first subjected to the Canny edge-detection 
algorithm”® based on the non-normalized contrast, with a Gaussian kernel radius 
of 2.0 and low and high thresholds of 2.0 and 7.0, respectively. This procedure 
results in the detection of the outer (water-facing) interfaces of the Janus droplet as 
shown in Extended Data Fig. 2d. Arcs corresponding to the H-W and F-W inter- 
faces are then selected by visual inspection and fitted to circles using Taubin’s 
algorithm’’. These arcs and their fitted circles are depicted as blue and red circles 
(with green dots for centres) in Extended Data Fig. 2e. The radii of these two circles 
correspond to the radii of curvature, Ry and Rp, of the H-W and F-W interfaces, 
respectively. The diameter, D, of the circle of contact between the hexane and per- 
fluorohexane phases is calculated as the minimum separation between the points 
lying between the arcs chosen to represent the H-W and F-W interfaces. This line 
of minimum separation is shown in Extended Data Fig. 2e. Image preprocessing 
and edge detection were performed in ImageJ v1.48, and the circle-fitting and 
interfacial tension calculation was performed in MATLAB. 

This droplet analysis was performed for six Janus droplets, three from each of 
the two conditions fens = 0.4 and fgps = 0.6. The interfacial tensions obtained are 
tabulated in Extended Data Table 2. The values of ypy do not vary significantly 
between the two sets of surfactant systems, which is consistent with our expecta- 
tion that )py is at best only weakly affected by the composition of the surfactant. 
The average value of )p; deduced from this analysis is 1.07 + 0.1 mN m. 

We note that accurately determining the interfacial tensions through the radii of 
curvature of the interfaces of the droplet is feasible only in the absence of fluid flow 
around the complex emulsion droplets. The presence of such a flow, even at low 
shear rates, leads to a deformation of the surface area of the droplet, causing local 
gradients in surfactant concentration and inducing Marangoni stresses**. The inter- 
play between these effects ultimately determines the interfacial tensions at any point 
ona complex emulsion droplet. Higher shear rates can dramatically disrupt droplet 
stability, leading to the fragmentation of the complex emulsion and the release of 
the encapsulated phases—a phenomenon whose onset has been examined in detail 
by theoretical*’ and experimental” studies. 


32. Haiying, L. & Zhongfan, L. A convenient synthesis of novel mercapto-ended 
azobenzene derivatives. Synth. Commun. 28, 3779-3785 (1998). 

33. Hua, X. Y. & Rosen, M. J. Dynamic surface tension of aqueous surfactant 
solutions: 1. Basic parameters. J. Colloid Interface Sci. 124, 652-659 (1988). 

34. Glantz, S. A. & Slinker, B. K. Primer of Applied Regression and Analysis of Variance 
2nd edn, 248 (McGraw-Hill, 1990). 

35. Harris, J. W. & Stécker, H. Handbook of Mathematics and Computational Science 
107 (Springer, 1998). 

36. Canny, J. A computational approach to edge detection. JEEE Trans. Patt. Anal. 
Mach. Interf. PAMI-8, 679-698 (1986). 

37. Taubin, G. Estimation of planar curves, surfaces, and nonplanar space curves 
defined by implicit equations with applications to edge and range image 
segmentation. /EEE Trans. Patt. Anal. Mach. Interf. 13, 1115-1138 (1991). 

38. Fischer, P. & Erni, P. Emulsion drops in external flow fields — the role of liquid 
interfaces. Curr. Opin. Colloid Interface Sci. 12, 196-205 (2007). 

39. Stone,H.A. &Leal, L.G. Breakup of concentric double emulsion droplets in linear 
flows. J. Fluid Mech. 211, 123-156 (1990). 

40. Muguet, V. et al. W/O/W multiple emulsions submitted to a linear shear flow: 
correlation between fragmentation and release. J. Colloid Interface Sci. 218, 
335-337 (1999). 


©2015 Macmillan Publishers Limited. All rights reserved 


9.5 


9.0 


8.5 


8.0 


75 


y(t) (mN mm") 


7.0 


6.5 


6.0 


100 150 200 


t(s) 


250 300 350 


Extended Data Figure 1 | Dynamic interfacial tension data was used to 
estimate the equilibrium interfacial tensions for the hexane-water and 
perfluorohexane-water interfaces. a, Dynamic interfacial tension data 

(in blue) was obtained from the pendant-drop method; the representative data 
shown here was measured for the hexane-water interface at feps = 0.9 

(such that the aqueous solution contained 0.1% SDS and 0.1% Zonyl in a 9:1 
ratio). The data was fitted to an empirical model (in red) to estimate the 
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equilibrium value of the interfacial tension y.gp = (t—> ©). Such fitting was 
performed for all measured interfacial tensions and the fitted parameter 
results are tabulated in Extended Data Table 1. b, The estimated equilibrium 
interfacial tension values were used to plot the hexane-water (squares) and 
perfluorohexane-water (circles) interfacial tensions as a function of the fraction 
of 0.1% SDS, fgps, where the other fraction is 0.1% Zonyl. See discussion in 
Methods for more details. 
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Extended Data Figure 2 | The geometry of a Janus droplet can be used to 
estimate the interfacial tension between hydrocarbon and fluorocarbon 
internal phases. a, Sketch of a Janus droplet consisting of hydrocarbon (grey) 
and fluorocarbon (white) phases within an aqueous outer phase. The radii 

of curvature of the H—W (R};), F-W (Rg) and F-H (Rgy;) interfaces are related 
to their respective interfacial tensions through the Young-Laplace equation. 
The diameter of the circle of contact between the two phases (dashed line) is 
denoted as D. b, The Janus droplet is composed of three spherical caps, and 
the volume, V-ap, of each constituent spherical cap is a function of the radius of 
curvature of the spherical surface and the base diameter D. Here we show 
the cap at the intersection of the hydrocarbon and fluorocarbon phases in 


which V ap is a function of Rpy and D. c, An exemplary image of a hexane- 
perfluorohexane Janus droplet obtained at fgps = 0.6, which was used to 
estimate Ryy and, in turn, py. d, The droplet pictured in c is subjected to edge 
detection to determine the H-W and F-W interfaces. e, The resulting edges are 
fitted to circles (red lines with green centres). The diameter of the circle of 
contact is then computed (green line). Given the ratio of the volumes of the two 
phases, we then determined the radius of curvature, Rpy, of the hexane- 
perfluorohexane interface, which was subsequently used to estimate py. See 
discussion in Methods for more details and Extended Data Table 2 for 
estimated values of yp. 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


fe) 
FFF FFFFF ae ° FFF FF FFF 
OH fe) 
: WoO GL Ee Sho GS oh 
F FF FF FFF H2SO,, BHT F FF FF FFF fe) 


NANNNr rr Oooo MADONMNNIUANN occ 
b Sesegeasss Seeeeees pe 
Oe Wp 
J mn ene | 1 
oS -_ ® N N e) 
a 2° Q Q 7 > 
rs) = ~) = oN 
8.0 7.5 70 #65 60 55 50 45 40 35 30 25 20 1.5 10 O58 0.0 
f1 (ppm) 


Extended Data Figure 3 | Reaction scheme and "H-NMR of the fluorinated crosslinker. a, Reaction scheme for the synthesis of the fluorinated crosslinker. 
BHT, 3,5-di-tert-butyl-4-hydroxytoluene. b, H-NMR spectrum of the fluorinated crosslinker. 
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Extended Data Figure 4 | 13C_NMR and !°F-NMR of the fluorinated crosslinker. a, *>C-NMR spectrum of the fluorinated crosslinker. b, 198_NIMR of the 
fluorinated crosslinker. 
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Extended Data Table 1 | Parameter values obtained by fitting an empirical model to interfacial tension measurements obtained using 
pendant-drop goniometry 


0.1% SDS Hexane-water interface Perfluorohexane-water interface 
fraction 
(G5) Yeqo (MN m*) t*(s) R? Yeqo (MN mr") t* (s) R? 
0.00 7.33 + 0.12 156 + 24 0.925 4.42 + 0.14 159 + 42 0.980 
0.20 7.15 + 0.03 337 + 15 0.989 5.80 + 0.04 106+ 5 0.979 
0.40 7.21 + 0.14 315 + 53 0.956 6.49 + 0.05 233 + 22 0.978 
0.60 6.33 + 0.22 2938. + 113 0.947 6.45 + 0.12 301 + 54 0.982 
0.90 6.39 + 0.02 35 + 1 0.992 7.66 + 0.03 57 + 1 0.994 
0.95 5.78 + 0.06 41+ 1 0.993 7.73 + 0.06 85 + 2 0.991 
1.00 14.04 + 0.15 21+ 3 0.909 19.56 + 0.3 702 + 50 0.967 


Dynamic interfacial tension data were fitted to a model (Methods) to estimate equilibrium interfacial tension values. The parameters obtained from the fitting are presented here. The larger the value of the 
characteristic time, t*, the slower the decay of the interfacial tension to its equilibrium value, yeqp. The value of R? is ameasure of how well the model fits the experimental data. For all the data used in this study, R? 
was above 0.9, signifying a good fit. 
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Extended Data Table 2 | Values of y-, were estimated from geometrical analysis of images of Janus droplets and are independent of the 
composition of the surfactant added to the system, fgps 


Fogg = 0-4 fogs = 0-6 
1.06 0.93 

Ye (MN mm") 1.17 1.18 
0.99 1.10 


The values of y¢ presented here were obtained by computing the radii of curvature of the hexane-perfluorohexane, hexane-water and perfluorohexane-water interfaces, and subsequently using the Young— 
Laplace equation (Methods). Six droplets in total, three each under the conditions fsps = 0.4 and fsps = 0.6, were used to estimate a value of ypy = 1.07 0.1 mNm_1. 
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An extremely high-altitude plume seen at Mars’ 


morning terminator 
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The Martian limb (that is, the observed ‘edge’ of the planet) repre- 
sents a unique window into the complex atmospheric phenomena 
occurring there. Clouds of ice crystals (CO, ice or H,O ice) have been 
observed numerous times by spacecraft and ground-based telescopes, 
showing that clouds are typically layered and always confined below 
an altitude of 100 kilometres; suspended dust has also been detected 
at altitudes up to 60 kilometres during major dust storms’ °. Highly 
concentrated and localized patches of auroral emission controlled 
by magnetic field anomalies in the crust have been observed at an 
altitude of 130 kilometres’. Here we report the occurrence in March 
and April 2012 of two bright, extremely high-altitude plumes at the 
Martian terminator (the day-night boundary) at 200 to 250 kilometres 
or more above the surface, and thus well into the ionosphere and the 
exosphere*’. They were spotted at a longitude of about 195° west, a 
latitude of about —45° (at Terra Cimmeria), extended about 500 to 
1,000 kilometres in both the north-south and east-west directions, 
and lasted for about 10 days. The features exhibited day-to-day vari- 
ability, and were seen at the morning terminator but not at the even- 
ing limb, which indicates rapid evolution in less than 10 hours anda 
cyclic behaviour. We used photometric measurements to explore two 
possible scenarios and investigate their nature. For particles reflect- 
ing solar radiation, clouds of CO,-ice or H,O-ice particles with an 
effective radius of 0.1 micrometres are favoured over dust. Alterna- 
tively, the plume could arise from auroral emission, of a brightness 
more than 1,000 times that of the Earth’s aurora, over a region with 
a strong magnetic anomaly where aurorae have previously been 
detected’. Importantly, both explanations defy our current under- 
standing of Mars’ upper atmosphere. 

On 12 March 2012, amateur astronomers reported a small protrusion 
above the morning terminator of the Martian southern hemisphere 
(Fig. 1). The protrusion became more prominent over the following days, 
and on 20 and 21 March it was captured by at least 18 observers employ- 
ing 20-40-cm telescopes at wavelengths from blue to red (~450-650 nm; 
Extended Data Figs 1-3). The feature was not detectable when it was 
passing Mars’ central meridian or when it reached the opposite afternoon 
limb. Available images and animations produced with the MARCI (Mars 
Color Imager) instrument on board the Mars Reconnaissance Orbiter 
(see Methods) do not show the feature, because such animations are 
essentially mosaics formed by planet strips taken at +2 h of 15:00 local 
Martian solar time (when the limb plume is not observed)'®. The 20- 
21March measurements show that the feature extends from mean lati- 
tudes —38.7° + 6.4° to —49.7° + 4.2° (~460 km) with extremes from 
the mean ranging from —32.3° to —53.9° (~900 km), and when rotating 
into view, its longitude (westward reference system) spans 190.2° + 4.2° 
to 201.4° + 6.8° (~660 km) with extremes from the mean ranging from 
186° to 208.2° (~1,310 km). Thus, the feature is approximately above 


the southeastern area of Terra Cimmeria’’. The event occurred when 
the Martian solar heliocentric longitude was L, = 85°-90° (early winter 
of the southern hemisphere), and the daily mean insolation gradient at 
Terra Cimmeria was large’. 

The plume was detected for 11 consecutive days from 12 to 23 March. 
There are however no observations of the terminator above Terra Cim- 
meria from 24 March to 1 April (when it was not observed), a fact that 
prevents us from putting an end date to the plume. Remarkably, the 
aspect of the features changed rapidly, their shapes going from double 
blob protrusions to pillars or finger-plume-like morphologies (Fig. 1). 
On 6 April a second, similar, event was observed near the same area and 
lasted until at least 16 April (Extended Data Fig. 2). Each event had an 
overall lifetime = 10 days, but changes over timescales <12 h between 
dawn and dusk were accompanied by day-to-day variability. A survey 
of full disk Martian images taken with the Hubble Space Telescope (HST) 
between 1995 and 1999, and of amateur image databases (the latter 
come from the Association of Lunar and Planetary Observers-Japan, 
and Société Astronomique de France (Commission Surfaces Planetaires); 
Methods) from Mars apparitions between 2001 to 2014 (a total of about 
3,500 images), shows the occasional presence of clouds at the limb similar 
to those observed by spacecraft. These clouds, however, are typically less 
prominent and conspicuous than in the 2012 events. Exceptionally, a 
set of HST images from 17 May 1997 show an abnormal plume, which 
is included here for photometric study’*™*. 

We have measured the altitude of the plume top as projected onto 
the dark background. Only the highest spatial resolution images and 
those that allowed us to monitor the plume as it rotates into view from 
behind the terminator were considered for measurements. These were 
two series (24 images in total) from 20 and 21 March lasting 50 min and 
70 min, respectively. In Fig. 2 we show the projected altitude of the 
plume top at terminator as a function of the terminator west longitude 
at latitude —45° (increasing longitudes with time). Measured single maxi- 
mum altitudes on these two days were 200 + 25 km for 20 March and 
280 + 25 km for 21 March. The feature’s projected altitude was measured 
over time as the planet rotated. Fitting these altitudes to a second-order 
polynomial resulted in maximum plume top altitudes of 185 + 30 km 
(20 March, west longitude at terminator 211.5°) and 260 + 30 km (21 
March, west longitude at terminator 214.5°). We have compared this 
fit to a model of the projected altitude that would result from a feature 
approaching the terminator and rotating into view for the Martian view- 
ing geometric conditions (see Methods and Extended Data Fig. 4). The 
model fitting of the data results in a plume top altitude of 200 + 50 km 
for 20 March and 260 + 50 km for 21 March. We propose that the devi- 
ations between model fit and data for 21 March could originate from 
horizontal structure in the plume or from time variability over <1 h. 
This analysis shows the extreme altitude of the plume top to be above 
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Figure 1 | A high-altitude plume at the Martian terminator. a, Navigated 
image (that is, observed planetary disk and fit to model: yellow line, disk; blue 
line, equator) that shows the feature location (red circle) on March 21 03:21 
(image, D. Parker). Dates and times are UTC; image in geographic orientation 
(north, up; east, right). b-f, Plume images with west longitudes and latitudes at 
terminator (inverted orientation to facilitate visibility) for dates as follows 
(yellow line marks the terminator): b, March 20 02:45 (image W.]J.); ¢, March 21 
02:51 (W,J.); d, March 21 03:45 (J.P.); e, March 21 03:21, colour composite 
(D. Parker); f, March 21 03:21, red filter (D. Parker). g, Plume relative 
brightness map for March 21 03:21, red image (D. Parker). The white 
doubled-ended vertical arrow in g indicates the estimated error for altitude 
determination (+50 km or 4 pixel size), which is on average 2 s.d. of the 
individual point uncertainty. 


200 km, never before observed on Mars, and reaching the ionosphere 
and exosphere. 

We have used the set of unprocessed images from 20 and 21 March 
to measure the plume’s average reflectivity (J/F) at the three available 
wavelengths over a box of size ~200 km X 200 km in the vertical and 
along-terminator directions (J, reflected radiation intensity; mF, incident 
solar flux). In order to gain insight into the photometric behaviour of 
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Figure 2 | Plume top altitude and its rapid changes. Filled circles show the 
plume’s altitude when rotating into view, as given by the west longitude at the 
terminator (longitudes increasing with time) for March 20 02:02-02:51 UT, 
blue, and March 21 02:40-03:50 UT, red. Error bars, s.d. for n = 5-10 
measurements. Solid curves, second degree polynomial fits to these two data 
sets. Dashed lines, altitudes predicted by a geometric model for a feature placed 
at the terminator with top altitude 250 or 280 km (red) and 200 km (blue) at 
west longitude 216°. The single data points are individual measurements on 
March dates as follows (error bars, 2 s.d., m = 5): 12 (cross), 15 (rhomb), 

17 (square), 19 (circle), 22 (triangle). Dashed lines show the projected altitude 
of the Mars shadow (dotted area under these curves) at two latitudes. 


the plume and have a comparison reference, we determined the reflec- 
tivity from 255 to 1,042 nm wavelength of a plume captured in 1997 by 
the HST Wide Field Planetary Camera (Extended Data Fig. 5). The 1997 
plume occurred (with its base on the terminator) at equatorial latitude 
—2.9° + 0.7°, west longitude 99.1° + 4° and solar longitude L, = 119.5°. 
Its plume-like shape, horizontal extent of 10° (~590 km) and morning 
occurrence suggest that both the 1997 and 2012 features are related phe- 
nomena. The actual altitude of the 1997 HST feature cannot be precisely 
determined because of the lack of a rotation image sequence, but we 
can constrain its altitude to be between 50 km (for the case of a large 
horizontal feature illuminated by a grazing Sun, that is, close to 0° solar 
illumination) and 480 km (for the case of a thin vertical feature under 
90° solar illumination). Figure 3 shows the measured reflectivity (I/F) 
for both the 1997 and 2012 events. The spectra appear flat within the 
quoted uncertainties. 

To interpret this phenomenon we explore two scenarios. First, we 
assume that the plume is a cloud formed by particles of H,O-ice, CO>- 
ice or dust. We performed forward radiative transfer modelling of the 
reflected intensity and compared it with the observations. The model 
considers multiple scattering and curvature effects for grazing illumi- 
nation and a slanted view (see Methods). For the optical properties, we 
assumed Mie theory’” and wavelength-dependent indices of refraction'*””. 
As shown in Fig. 3, dust particles can be ruled out for the 1997 event and 
show only marginal agreement with the 2012 data (see also Extended 
Data Fig. 6). Typically, the best fits occur for CO, or H20 ice particles 
with effective radii of 0.17 };594 Hm, which is consistent with the particle 
size of mesospheric clouds observed at night*. Comparably good fits 
are obtained for a broad range of particle effective variances (0.1-2.0), 
which means that the latter parameter cannot be constrained with the 
available information. The best fit occurs for a layer of vertical thick- 
ness 100 km and a nadir optical depth ty > 0.5 for both the 1997 and 
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Figure 3 | Plume reflectivity and radiative transfer model comparison. 
Spectral reflectivity of the 20-21 March 2012 event (triangles) and of the 

17 May 1997 event (circles) from ground-based and HST observations, 
respectively. The error bars represent the average quadratic deviation of the 
measured reflectivity in the integration box. Observations are compared with 
the best-fitting model for the case where the protrusion is assumed to be a cloud 
formed by spherical particles: CO,-ice (blue solid line), HO-ice (red dot- 
dashed line) or Martian dust (green dashed line). Stars indicate the wavelength 
grid used for radiative transfer computations. 


2012 events, which implies number densities of 0.01 particles per cm? 
(see Methods and Extended Data Figs 7 and 8). Vaporization of these 
particles will produce an increase in the gas concentration by a factor 
of ~1,000 for H,O and 5% for CO, (see Methods). 

According to a general circulation model (GCM)’*”” for conditions 
specific to the observation dates, H,O condensation at the relevant alti- 
tudes requires either anomalously cold thermospheric temperatures (with 
temperature drop >50 K) or an unusual increase in the H.O mixing 
ratio from 10 * to complete saturation above 140 km (Fig. 4)°. CO, 
condensation would require an even larger temperature drop of 100 K 
above 125 km. On the other hand, explaining the cloud as formed by 
dust would require vigorous vertical transport up to at least 180 km above 
the surface (but this is not predicted by GCMs) or by vigorous updrafts 
due to dry convection under high insolation (dust heating follows from 
the absorption of solar radiation)’. Thus, upward motions are more 
likely to occur at noon than in the morning; all of which, together with 
the photometric data, makes the dust hypothesis difficult to support. 

In the second scenario, we explored whether the 2012 plume might 
be attributable to an aurora. Mars aurorae have been observed near where 
the plume occurs, a region with a large anomaly in the crustal magnetic 
field (at 175° west)’ that can drive the precipitation of solar wind 
particles into the atmosphere’. The meridional extent of the plume 
(~500 km, about consistent with aurorae’) and its variability could 
also support this hypothesis. Mars’ ultraviolet aurora is dominated by 
the Cameron bands of carbon monoxide (CO), with limb intensities of 
kilorayleighs (kR; refs 7, 24); visible counterpart emissions are also 
expected. However, quantitative estimates of the aurora intensity defy 
such a hypothesis. An J/F value of 0.04 at 550 nm (Fig. 3) translates into 
an auroral limb intensity of 3,600 megarayleighs (MR), much more than 
the 1-MR nadir emission of strong terrestrial aurorae”’. Furthermore, 
an auroral hypothesis requires an exceptional influx of energetic part- 
icles over days, although solar activity in March 2012 was not unusually 
high**. Even accounting for deficiencies in the understanding of auroral 
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Figure 4 | Atmospheric temperature profile, and water and carbon dioxide 
condensation temperatures. These are calculated for the conditions of the 
2012 event (latitude — 40°, longitude 200°, L, = 90°, local time LT = 6h) 
according to a general circulation model'*’’. Atmospheric temperature is 
shown by the black line, with grey profiles indicating values extending the range 
by +10° for latitude, 10° in longitude and 2h in LT. Condensation 
temperatures for water (dashed green line) and carbon dioxide (dashed red 
line) are obtained from the saturation vapour pressure curves of both 
compounds”. 


excitation”, the extrapolation from ultraviolet intensities to the visible 
falls short of the reported plume brightness by orders of magnitude. 
Confirmation or rejection of the auroral hypothesis is, however, feasi- 
ble merely by extended monitoring from ground or space. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


Image availability. Amateur images. Available at the following databases: Associ- 
ation of Lunar and Planetary Observers-Japan, http://alpo-j.asahikawa-med.ac.jp/ 
Latest/Mars.htm (2012), and Société Astronomique de France (Commission sur- 
faces Planetaires), http://www.astrosurf.com/pellier/mars (2012). 

Hubble Space Telescope images. Available at NASA Planetary Data System (PDS) 
or at ESA HST archive, http://archives.esac.esa.int/hst/#Q-+R-S+D-1 (2012). 
MRO MARCI Weather Reports. Malin Space Science Systems. Captioned Image 
Release No. MSSS-216 - 21 March 2012 http://www.msss.com/msss_images/2012/ 
03/21/ and Captioned Image Release No. MSSS-217 - 28 March 2012 http://www. 
msss.com/msss_images/2012/03/28/. 

MOLA maps. Available at http://mola.gsfc.nasa.gov/images/topo_labeled.jpg. 
Image analysis and measurement. Most images were obtained with broadband 
filters covering the blue (B, Ae = 450 + 50 nm), green (G, Aeg = 520 + 50 nm) and 
red (R, Age = 625 + 75 nm) spectral bands. Each image was constructed by sum- 
ming, aligning and re-centring sequences of frames captured in video mode, and 
application of the ‘lucky imaging’ method” and the ‘wavelet’ as the main pro- 
cessing technique”’. The Airy disk for a telescope of 356 mm in diameter (Fig. 1a, b, 
d-g and photometric analysis) at a wavelength of 450 nm is 0.317 arcsec and the 
plate scale employed was 0.045 arcsec per pixel equivalent to an effective pixel size 
of 25 km on Mars, typically used in high resolution planetary images where record- 
ing features depends, for excellent seeing conditions, on resolution and features 
contrast”. The total spatial coverage of the projected plume is ~850 pixels (21 March, 
Fig. 1g). Image navigation, that is, the determination of the planetary limb and ter- 
minator was performed using two well-tested software packages (LAIA, WinJupos; 
ref. 33 and http://www.grischa-hahn.homepage.t-online.de/astro/index.htm). The 
bright Martian limb was used as a reference but not the terminator whose defini- 
tion is affected by the presence of the dark and bright features. Navigation was accu- 
rately controlled by measuring the position of well-known surface features (Olympus 
Mons and Tharsis volcanoes). The projected plume top altitude on the sky plane 
(defined as the border between the feature and the dark background, contrast forced) 
was simply determined from its distance to the Mars disk centre and subtraction of 
the measured Martian radius (all in pixel units). Navigation and image measure- 
ments were performed independently by four of us. From these multiple measure- 
ments, we derived the mean values of the Mars radius and top altitude for each image. 
The uncertainties represent the root mean square (r.m.s.) values of the altitude 
determinations. We estimate that the positioning error is +2 pixels and that, for 
the above plate-scale, this corresponds to +50 km. Uncertainties in plume mea- 
surements result from brightness diffusion at the top and sides of the plumes and 
from the determination of the base location of the plume relative to the navigated 
terminator, including its brightness irregularities. 

As a further step in navigation control, we used the image series of 20 March by 
W.J. to compare the results of the top altitude measurements by two different pro- 
cedures: (1) through measurements on each image of the planet and plume top 
radius to calculate the plume maximum altitude; (2) through determination of the 
mean radius for the whole image series (and its uncertainty) and then using this mean 
value for each image to determine the top altitude. Both approaches yield consis- 
tent results within the uncertainties indicated in the text. 

The measured top altitude values (H) and the terminator west longitudes (L, defined 

as the central meridian longitude for each observing time + 70°) for 20 and 21 March, 
fitted to a second order polynomial H = ay + a,L + apL”, gave the peak values 
at Lmax = —4,/2a5, and then the corresponding Hyax. The retrieved coefficients 
are, for 20 March, ap = —15,233.8, a, = 144.7, ag = —0.3395, and for 21 March, 
ap = —43,164.8, a, = 401.8, ay = —0.929. 
Plume altitude model. Angular distance between the plume’s base and the central 
meridian. Our coordinate system is placed at the centre of Mars: the x axis will be on 
the observer’s visual and the y-z plane is coincident with the sky plane. Coordinate 
system geometry is depicted in Extended Data Fig. 4, with the cloud rising near the 
west morning limb and the sub-Earth point on the equator. The longitude of the 
Central Meridian (CM) is Acm, and the latitude and longitude of the plume are 
respectively y, and A,. The longitudinal angular distance of the plume to the CM is 
related to the angular distance f between the plume and the limb taking as a 
reference the negative y axis, by: 


90° — B=4,—Acm (1) 


where f is positive if the limb is behind the cloud, and negative if the cloud is 
located behind the limb. 

Limit altitude for plume visibility. Let z(B) be the visible projected plume altitude 
above Mars’ limb not blocked by the planet’s disk when / < 0. When the plume is 
behind the limb it protrudes from the planet’s limb only if its total projected length 
on the y-z plane from the planet centre is larger than the planet’s radius Ry. With 
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cy and c, being respectively the y and z coordinates of the cloud top, we have 
{e +2 \ — Ry +2(B). In terms of f and the real cloud altitude H, we get 
es cos g, cos B (2) 
¢, =(Ru + H)sing, 
The minimum height of the plume H,,in to protrude from the limb (z(f) = 0) is 
Ru 

{1— cos? sin? B} 
Real plume altitude derived from its projected length. When f < 0, by expression 

(2), z(B) must satisfy 


Amin 


1/2 Ru (3) 


z(B) =(Ru +H) {cos? g, cos? B +sin?g.}"? —Ry (4) 
and the actual plume altitude is: 
2(B) +R 
{1— cos? g,sin2p}/” 


If £ > 0, the whole plume is visible (we see its projected total height). The cloud 
height H is now given by: 


H 


Ru (5) 


2(B) 
{1— cos? g, sin? B} 
Correction when Mars has a declination Dy as observed from Earth. By taking the 
declination angle + D, when the north pole is tilted towards Earth and — Dg when 


the planet’s south pole is visible, the transformation matrix due to a rotation 
around the y axis by an angle + Dg is 


H 


3 (6) 


x! cosDg 0. sinDg x 
yl= 0 1 0 y (7) 
Zz —sinDg 0 cosDs | | z 


Therefore, expressions (5) and (6) must be corrected taking into account that: 
eee (8) 
z = —xsinDz +z cos Dg 
In the rotated coordinate system, the projected plume altitude is z'(f) = 
ts +¢7 \ es Ryu, with c, and c, being the transformed coordinates of the plume 
top, with the non-transformed coordinates of the plume top given by: 
cy =(Rm + H) cos g, sin B 
cy = —(Ru + H) cos ¢, cos f (9) 


cz = (Ru + H)sing, 
Expressing ¢, and c, in terms of c,, cy and c, when the plume is behind the limb 
gives its real height 
2(B) + Ru 
{cos? g, cos? B+ (sing, cos Dg — cos g, sinB sinDg)*} 


H 


vam (10) 
Owing to the planet’s tilt, the plume will be on the limb not for 6 = 0 but for a 
different value [iin Therefore, for 6 = Pim, Z'(B) = 0 if H = 0. Applying this con- 
dition to expression (10) and after some algebraic manipulations, we obtain: 


(11) 


When Dg = 0 we recover iim = 0. Note that at the equator jim is always 0 regard- 
less the value of Dy. Expression (10) is valid for B < fj;,, (the plume is behind the 
planet’s limb). When f' > },;,, the plume is before the limb and is not blocked by 
the planet and we have: 


Bim = —sin”'tan(g.) x tan(Dg) 


z'(B) 


H 
{cos* g, cos? 8 + (sing, cos Dg — cos g, sinB sinDg)?} 


(12) 


Mars’ shadow projection on the plume. Given the coordinates of the subsolar point 
on Mars (A,, g,) and the plume coordinates (A,, y.) we have both points separated 
by an angle: 

y= cos ![cos g, cos g, cos(A,; — Ac) + sing, sing, 


(13) 


This expression is also valid when /, and A, are the respective distances to the CM. 
In this case, 2, = (1/2) — 6 when we are close to the morning limb and: 


y= cos [cos , cos g,sin(As + B) +sing, sing, ] (14) 
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Looking at Extended Data Fig. 5, the angle « = y — (1/2) and the height of the 
planet’s shadow projected on the plume is: 


(15) 


For f < 0, using equations (5) and (15), the visible projected length of the planet’s 
shadow projected on the protruding plume is: 


oo" (16) 


"cosa 
For f > 0 the planet’s shadow projected on the plume is completely visible, since 
the planetary disk does not block it. The real height of the shadow will be Sc again, 
and according to equation (6) the projected shadow’s height will be: 


{cos g, cos” B+sin2o.}"? —Ru 


S(B) =Sc{cos? g, cos B+sin?g,}'? 
(17) 


=Ru| -1] {cos” Qe cos? B+sin?g,}" 
COS & 


Again, if the planet’s rotation axis is tilted by + Dz degrees, we obtain 


for B< Bj, and 


{cos? g, cos” B + (sing, cos Dy — cos g, sinB sinDs)*}"” —Rw (18) 


S(B) =Sc{cos? g, cos? 8 + (sing, cos Dp — cos g, sinf sinDg)” } oe (19) 


for B= Bim- 

The model predictions are shown in Fig. 2. A 7° calculation between the mea- 
sured top altitudes considering the individual error and the predictions from the 
geometric model for a uniform plume shows that for 20 March y” < 4 for top alti- 
tudes 190-195 km (best fit), but for 21 March y*(min) ~ 20 for top altitudes 240- 
250 km. The large deviations between the model and the observations for 21 March 
data rule out a uniform structure for the plume. 

Altitude of the 1997 HST feature. The 17 May 1997 event was captured by HST ona 
single image at each specific wavelength, and thus they do not show the plume in 
rotation at the limb/terminator. Therefore we can only constrain the maximum 
and minimum altitudes for the feature as seen projected on the terminator. The 
feature protrudes on the terminator by 10° (angle « in Extended Data Fig. 4). For 
solar zenith angle SZA = 90° (grazing illumination) the minimum altitude is given 
by equation (15): Hin = Sc = 53 km. On the other hand, the maximum altitude 
H ymax Will occur for a vertical feature whose projection L as seen on the terminator 


from Earth (phase angle ¢ = 35.5°) is given by L= \/ (Ru +5.) — Ry? and then 
Hmax = Ltand = 480 km. 

Photometric calibration. (1) Ground-based observations (20-21 March 2012). 
Stacked and aligned unprocessed images were used for reflectivity measurements. 
Because of their higher resolution, only the series B, G, R filtered images from W.J. 
and D. Parker were used for that purpose. The photometry from the brightest part 
of the cloud was compared to different bright and dark regions of Mars whose abso- 
lute (I/F) reflectivity is known™. The resulting I/F can be seen as an average for the 
cloud at B, G and R, and from independent measurements by the team members of 
the same images, we estimate an uncertainty of 20%. 

(2) HST calibration (17 May 1997). Hubble Space Telescope observations acquired 
in filters from the ultraviolet (F255W) to the near-infrared (F1042M) (Extended 
Data Fig. 5) were photometrically calibrated following the WFPC2 handbook 
instructions. Radiances were converted into absolute reflectivity J/F using the solar 
spectrum*>*”*, The resulting I/F as a function of planetary geographical coordinates 
was confirmed against values given by other authors for selected locations of the 
planet” and with global albedo values of the planet”. The intensity calibration method 
for ground-based images was validated with a similar procedure to that used in 
HST images and comparison with the above procedure. 

Radiative transfer model. We conducted radiative transfer calculations of the 
reflectivity I/F with a backward Monte Carlo model designed for spherical-shell 
atmospheres**. The model produces as output the full Stokes vector of radiances, 
but only the first vector element (intensity) was considered for comparison with the 
observations. The model naturally accounts for curvature effects in the (nearly) 
grazing illumination/viewing geometries of both the 1997 and 2012 events. In its 
specific implementation here, the model assumes a uniform cloud of geometrical 
thickness D that enshrouds the entire planet. The observer looks upon the planet 
terminator with a line of sight that sequentially probes the full range of altitudes 
where the cloud is present. For the 1997 event, plane 7un (formed by the local ver- 
tical direction at the planet’s terminator and the Sun illumination direction), and 
plane 7p; (formed by the same local vertical direction and the observer viewing 
direction) are offset by an azimuth angle of 15°. The cosine of the polar angle 
between the observer’s viewing direction and the local vertical at the terminator is 


0.514 + 0.02, and is 0.404 + 0.03 for the farthest point of the plume from the 
terminator. The phase angle (Sun—Mars-Earth) was 35.6°. Correspondingly, for 
the 2012 event, the prescribed azimuth angle and the cosine of the polar angle are 
0° and 0.014 + 0.03, and the phase angle (Sun-Mars-Earth) was 7.15°. Solar photons 
enter the atmosphere following grazing trajectories, and penetrate half way through 
the limb before being scattered towards the observer at the terminator. To compare 
with the box-averaged I/Fs from observations, the model-calculated I/Fs were also 
averaged over the cloud geometrical thickness D from the observer’s vantage point. 
A number of photon realizations between 10* and 10° ensured accuracies typically 
better than 5%, which is much less than the measured uncertainties. The atmo- 
spheric inputs to the model include the cloud particles’ optical properties and the 
cloud (vertical) optical thickness ty, in addition to the geometrical thickness of the 
cloud D. 

In the prescription of particle sizes, we used a power law distribution described 
by the two moments ref; (effective radius) and Ver (effective variance)'*. Particle 
candidates are CO)-ice, H,O-ice and dust, and we considered their wavelength- 
dependent refractive indices'*"’. 

The particles’ optical properties required in the radiative transfer calculations 
are a, p(0) and a, which stand for the particles’ single scattering albedo, scattering 
phase function and extinction cross-section, respectively. Each of these properties 
is specified at the appropriate radiation wavelength J. For the determination of the 
optical properties we used Mie theory, which applies to scattering by spherical par- 
ticles. If the particles are non-spherical, Mie theory may bias the properties of inferred 
particles that fit best the measured I/F reflectivities. The differences in optical 
properties between spherical and non-spherical particles depend on the particles’ 
shape and size distribution, as well as on their composition and on the radiation 
wavelength. Numerical investigations have explored these differences over a limited 
range of parameter space’. Without additional information on the cloud particles, 
it is difficult to assess the consistency in the treatment with Mie theory of the radi- 
ative transfer problem. We can nevertheless tentatively quantify it by comparison 
against specific solutions from the literature. For reg = 0.1 pm and / = 0.55 um, 
the effective size parameter X= 21r.g/A ~ 1. Focusing on the scattering phase 
function, which is particularly sensitive to the particles’ shape, a few studies’* sug- 
gest that for xeg~ 1 the differences between spherical and non-spherical particles 
are not usually more than a few tens of per cent. These differences are less than for 
larger values of Xegz. 

Overall, each radiative transfer calculation requires user-inputted values for /, 
particle composition (and therefore wavelength-dependent refractive indices), regs 
Vege and Ty (also dependent on wavelength). The problem of comparing reflectiv- 
ities I/F from model calculations and observations is therefore multi-parametric. 
In the exploration of the space of parameters, we built a grid of solutions to the radi- 
ative transfer problem for D (10, 50, 100, 200 km), reg (0.01 to 2 um; 20 values with 
varying steps between 0.01 and 0.1 jim), Veg (0.1, 0.2, 0.5, 1 and 2; range consistent 
with values adopted in prior Mars investigations*’**?) and Ty 502mm (=Tn at 502 nm; 
from 16 to 10 *; 31 values in steps following a power law). The total number of 
radiative transfer calculations for each of the 1997 and 2012 events exceeds 10°. 

For each possible spectrum within the grid of calculations with identical particle 
composition, reg Ver¢ ANd Tn,so2nm We calculated the average quadratic deviation of 
the model from the observed data weighted by the observation errors ( 7). A subset 
of the free parameter space with best-fitting values is shown in Extended Data Fig. 6 
for the 1997 event and Extended Data Fig. 7 for the 2012 event. In the former case, 
dust particles do not fit the data with y” < 1.0. Whilea number of sub-optimal solu- 
tions for different combinations of the free parameters can be found, the highest 
density of low 7’ solutions can be interpreted as the most likely combination of 
the free parameters. This happens for particles with effective radii of the order of 
0.1 jm, depending on the particles’ effective variance and composition. In this 
framework, there is no way to discriminate between the variances and this also 
impedes making a choice between H,O and CO, as the most likely particle type. 
Similarly, the 7’ results for the 2012 observations show a broad range of particle 
sizes and optical thicknesses that provide 7° < 1.0. They include dust particles 
that, in any case, result in fits that are worse than for the H,O and CO, composi- 
tions. Extended Data Fig. 8 shows how, given the narrow spectral range covered 
during the 2012 event, it is possible to retrieve acceptable fits for particle sizes rang- 
ing from 0.1 to 1.0 jm, provided that the rest of the free parameters are adequately 
tuned. Observations at shorter/longer wavelengths might potentially break this 
degeneracy. 

All in all, the radiative transfer calculations for both the 1997 and 2012 events 
constrain to some extent the effective particle radius r., and the vertical extension 
of the cloud D, and give a lower limit for the normal optical thickness T,,502nm- Unfor- 
tunately, additional data are required to constrain other parameters, such as Ver or 
the cloud particle composition. The optical thickness T,,502nm retrieved from the 
model fit translates into particle density after dividing the optical thickness by the 
particle cross-section at 502 nm and the cloud geometrical thickness. 
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Code availability. All radiative transport calculations were conducted with a 
novel pre-conditioned backward Monte Carlo (PBMC) model”. A plane-parallel 
version of the model is freely available through: http://dx.doi.org/10.1051/0004- 
6361/201424042. This version contains instructions to run the model, test cases 
and model outputs that can be compared to calculations tabulated in ref. 38. In the 
analysis of the Mars plume, we used a version of the PBMC suitable for spherical 
shell atmospheres. The spherical-shell model is available on request from A.G.M. 
(tonhingm@gmail.com). 

Mars General Circulation Model. As supplied in the Mars Climate Database at 
http://www-mars.lmd.jussieu.fr/. 

Cloud evaporation. H,O-cloud. Taking water-ice density as 0.93 g cm ° (ref. 40), 
the mass of a spherical particle with radius 0.1 jum is m = 4 X 10 '* g; the corres- 
ponding number of molecules resulting from evaporation of this particle is MNavyo/ 
=~ 108, taking Nayo = 6.023 X 107° particles per mol and the molecular weight of 
water is j1 = 18 g mol. For the retrieved cloud density 0.01 cm” °, evaporation of 
the cloud will produce ~10° particles per cm*. On the other hand, the atmospheric 
density isp =5 X 107 1? gem > at an altitude of 200 km, at latitude — 40°, L, = 90°, 
210° W longitude and local time 6h (ref. 18). The mean molecular weight of the 
atmosphere at this altitude is now ji = 17 gmol ' because of oxygen dominance 
with a volume mixing ratio vmr = 0.6 and a 13% contribution from N, and CO. 
The vir of H,0 at this altitude is 4 X 10°. The number density of molecules in 
the atmosphere is N(H2O,atm.) = (Pu,0 x Navo/ it) xX VT, oO ~ 10° particlescm~ 3. 
Therefore, evaporation of the water ice will produce an enrichment of water by a 
factor ~1,000 over normal conditions. 

CO>-cloud. Taking CO>-ice density as 1.5 gcm™*, the mass of a spherical particle 
with radius 0.1 jum will be m = 6 X 107 '° gand the corresponding number of molec- 
ules resulting from evaporation is mNayo/ ju’ ~ 10° where ’ = 44g mol! is the 
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molecular weight of CO}. For the retrieved cloud particle density of 0.01 cm 7°, 


cloud evaporation will produce ~ 10° particles per cm*. On the other hand, from the 
above atmospheric conditions, and vmr CO, = 0.09, the number density of molecules 
in the atmosphere is N(CO>, atm.) = (co, x Navo /B) x vmrco, ~ 10” particlescm~ a 
Therefore, CO ice evaporation will produce a 5% enrichment of CO). 
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Extended Data Figure 1 | Images of the 2012 plume event (ringed) on 
12-20 March. Dates in March (and authors) are as follows: a, 12 (M.D.); 

b, 15 (D. Peach); c-f, 20, plume in rotation (W.J.). Time indicated at top left of 
each panel is in UTC. 
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Extended Data Figure 2 | Images of the 2012 plume events (ringed) on 22 March and 13 April. a, First event on 22 March, 04:12 UTC (image by W,J.); 
b, second event on 13 April, 20:03 UTC (image by D. Peach). 
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Extended Data Figure 3 | Images of the 2012 plume event (ringed) at different wavelengths on 21 March. a-c, Images by J.P., d-f, images by D. Parker, with 
filters indicated: a, d, B (blue); b, e, G (green); ¢, f, R (red). Time indicated is in UTC. 
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Extended Data Figure 4 | Martian viewing geometry. a, Angle definitions 
with the simulated protrusion of altitude H located at point c and out of the 
illuminated part of the disk near the limb. b, Top view, taking as a reference the 
planet’s terminator and definition of and « angles when the cloud is on 

the equator (but the latitude of the subsolar point is not zero). Green arrows 
represent the projected cloud altitude as seen from Earth in the extreme 
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situations when the cloud is on the terminator and follows the grazing sunlight, 
and when it follows the planet’s radius. To simplify the figure, and without 
loss of generality, the sub-Earth point is placed on the arc linking the subsolar 


point and the cloud base. c, General side view of the geometry of the planet’s 
projected shadow. 
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Identification HST number Filter Wavelength (nm) | FWHM (nm) 

a u3gi7701m.fits F255W 257.3 42.6 

b u3gi7703m.fits F410M 408.8 18.2 

c u3gi7704m. fits F502N 5012.2 3.6 

e u3gi7705m.fits F673N 673.2 6.3 

f u3gi7707m.fits | F1042M 1045.3 89.7 

d u3gi7708m. fits F588N 589.3 6.4 
Extended Data Figure 5 | Hubble Space Telescope images of the event on f, 1,042 nm (17:47); g, Colour composite. Plume ringed in a-f, arrowed in g. 
May 17 1997. Wavelengths and times in UTC were: a, 255 nm (17:27); Table at bottom identifies each image and its HST number, and also shows 
b, 410 nm (17:35); c, 502 nm (17:38); d, 588 nm (17:50); e, 673 nm (17:41); filters used, giving their central wavelength and bandwidth (FWHM). 
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Extended Data Figure 6 | Radiative transfer model fit for the 2012 event. Tee = 0.1 um; red, data for reg = 1.0 tum; stars, wavelengths used in the 

This is an example of the degeneracy of the model solution due to the narrow calculations. As in Fig. 3, open black triangles show the observed reflectivity of 
wavelength range covered in the 2012 event. Model fit as follows: solid the 2012 cloud. The error bars represent the average quadratic deviation of the 
lines, CO; dot-dashed lines, HO particles; blue, data for an effective radius measured reflectivity in the integration box. 
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Extended Data Figure 7 | Assessment of the: radiative transfer model fit for — _b, CO, (1.0); c, CO, (2.0); d, DST (dust, 0.5); e, DST (dust, 1.0); f, DST 

the 1997 event. a-i, Colours show values of y* (for measured I/F versus model —_ (dust, 2.0); g, HO (0.5); h, H,O (1.0); and i, HO (2.0). The calculations are for 
calculation, colour scale at right) for the effective radius (reg in |1m) versus a vertical extension of the cloud with D = 100 km, and they provide the 
optical depth (ty at 502 nm), and for different particle types and values of the _best-fitting values of the whole free parameter space. 

indicated particle variance (v-¢, shown in parentheses) as follows. a, CO; (0.5); 
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Extended Data Figure 8 | Assessment of the radiative transfer model fit for the 2012 event. As Extended Data Fig. 7 in terms of variables plotted, particle types 
and particle variance, but for the 2012 event. 


©2015 Macmillan Publishers Limited. All rights reserved 


Mae Ae dL Teas 


doi:10.1038/nature14236 


Human_-level control through deep reinforcement 


learning 


Volodymyr Mnih!*, Koray Kavukcuoglu'*, David Silver’, Andrei A. Rusu!, Joel Veness', Marc G. Bellemare!, Alex Graves!, 
Martin Riedmiller', Andreas K. Fidjeland', Georg Ostrovski!, Stig Petersen!, Charles Beattie’, Amir Sadik', Ioannis Antonoglou!, 
Helen King', Dharshan Kumaran’, Daan Wierstra', Shane Legg! & Demis Hassabis! 


The theory of reinforcement learning provides a normative account’, 
deeply rooted in psychological’ and neuroscientific’ perspectives on 
animal behaviour, of how agents may optimize their control of an 
environment. To use reinforcement learning successfully in situations 
approaching real-world complexity, however, agents are confronted 
with a difficult task: they must derive efficient representations of the 
environment from high-dimensional sensory inputs, and use these 
to generalize past experience to new situations. Remarkably, humans 
and other animals seem to solve this problem through a harmonious 
combination of reinforcement learning and hierarchical sensory pro- 
cessing systems**, the former evidenced by a wealth of neural data 
revealing notable parallels between the phasic signals emitted by dopa- 
minergic neurons and temporal difference reinforcement learning 
algorithms*. While reinforcement learning agents have achieved some 
successes in a variety of domains**, their applicability has previously 
been limited to domains in which useful features can be handcrafted, 
or to domains with fully observed, low-dimensional state spaces. 
Here we use recent advances in training deep neural networks’"' to 
develop a novel artificial agent, termed a deep Q-network, that can 
learn successful policies directly from high-dimensional sensory inputs 
using end-to-end reinforcement learning. We tested this agent on 
the challenging domain of classic Atari 2600 games’”. We demon- 
strate that the deep Q-network agent, receiving only the pixels and 
the game score as inputs, was able to surpass the performance of all 
previous algorithms and achieve a level comparable to that of a pro- 
fessional human games tester across a set of 49 games, using the same 
algorithm, network architecture and hyperparameters. This work 
bridges the divide between high-dimensional sensory inputs and 
actions, resulting in the first artificial agent that is capable of learn- 
ing to excel at a diverse array of challenging tasks. 

We set out to create a single algorithm that would be able to develop 
a wide range of competencies on a varied range of challenging tasks—a 
central goal of general artificial intelligence’* that has eluded previous 
efforts*'*"*. To achieve this, we developed a novel agent, a deep Q-network 
(DQN), which is able to combine reinforcement learning with a class 
of artificial neural network'* known as deep neural networks. Notably, 
recent advances in deep neural networks””’, in which several layers of 
nodes are used to build up progressively more abstract representations 
of the data, have made it possible for artificial neural networks to learn 
concepts such as object categories directly from raw sensory data. We 
use one particularly successful architecture, the deep convolutional 
network’, which uses hierarchical layers of tiled convolutional filters 
to mimic the effects of receptive fields—inspired by Hubel and Wiesel’s 
seminal work on feedforward processing in early visual cortex'*—thereby 
exploiting the local spatial correlations present in images, and building 
in robustness to natural transformations such as changes of viewpoint 
or scale. 

We consider tasks in which the agent interacts with an environment 
through a sequence of observations, actions and rewards. The goal of the 


agent is to select actions in a fashion that maximizes cumulative future 
reward. More formally, we use a deep convolutional neural network to 
approximate the optimal action-value function 


ak 7 age 
Q*(s,a) = max B[r: Ey tY t4+2t .--|se=s, a =a, Tt], 
na 


which is the maximum sum of rewards r; discounted by y at each time- 
step t, achievable by a behaviour policy x = P(a|s), after making an 
observation (s) and taking an action (a) (see Methods)”. 

Reinforcement learning is known to be unstable or even to diverge 
when a nonlinear function approximator such as a neural network is 
used to represent the action-value (also known as Q) function”. This 
instability has several causes: the correlations present in the sequence 
of observations, the fact that small updates to Q may significantly change 
the policy and therefore change the data distribution, and the correlations 
between the action-values (Q) and the target values r + y max O(s', a’). 
We address these instabilities with a novel variant of Q-learning, which 
uses two key ideas. First, we used a biologically inspired mechanism 
termed experience replay*’** that randomizes over the data, thereby 
removing correlations in the observation sequence and smoothing over 
changes in the data distribution (see below for details). Second, we used 
an iterative update that adjusts the action-values (Q) towards target 
values that are only periodically updated, thereby reducing correlations 
with the target. 

While other stable methods exist for training neural networks in the 
reinforcement learning setting, such as neural fitted Q-iteration™, these 
methods involve the repeated training of networks de novo on hundreds 
of iterations. Consequently, these methods, unlike our algorithm, are 
too inefficient to be used successfully with large neural networks. We 
parameterize an approximate value function Q(s,a;0;) using the deep 
convolutional neural network shown in Fig. 1, in which 0; are the param- 
eters (that is, weights) of the Q-network at iteration i. To perform 
experience replay we store the agent’s experiences e; = (Sp4j1p5; +1) 
at each time-step ft in a data set D; = {e,,...,e,}. During learning, we 
apply Q-learning updates, on samples (or minibatches) of experience 
(s,a,r,s’) ~ U(D), drawn uniformly at random from the pool of stored 
samples. The Q-learning update at iteration i uses the following loss 
function: 


L;(0;) = Ei(s,a,r,s’) ~U(D) 


2 
(r+ max O(s',a’; 0; )— O(s,a; 0) ) | 


in which y is the discount factor determining the agent’s horizon, 0; are 
the parameters of the Q-network at iteration iand 0; are the network 
parameters used to compute the target at iteration i. The target net- 
work parameters 0; are only updated with the Q-network parameters 
(0;) every C steps and are held fixed between individual updates (see 
Methods). 

To evaluate our DQN agent, we took advantage of the Atari 2600 
platform, which offers a diverse array of tasks (n = 49) designed to be 


1Google DeepMind, 5 New Street Square, London EC4A 3TW, UK. 
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Figure 1 | Schematic illustration of the convolutional neural network. The 
details of the architecture are explained in the Methods. The input to the neural 
network consists of an 84 X 84 X 4 image produced by the preprocessing 
map ¢, followed by three convolutional layers (note: snaking blue line 


difficult and engaging for human players. We used the same network 
architecture, hyperparameter values (see Extended Data Table 1) and 
learning procedure throughout—taking high-dimensional data (210 x 160 
colour video at 60 Hz) as input—to demonstrate that our approach 
robustly learns successful policies over a variety of games based solely 
on sensory inputs with only very minimal prior knowledge (that is, merely 
the input data were visual images, and the number of actions available 
in each game, but not their correspondences; see Methods). Notably, 
our method was able to train large neural networks using a reinforce- 
ment learning signal and stochastic gradient descent in a stable manner— 
illustrated by the temporal evolution of two indices of learning (the 
agent’s average score-per-episode and average predicted Q-values; see 
Fig. 2 and Supplementary Discussion for details). 
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Figure 2 | Training curves tracking the agent’s average score and average 
predicted action-value. a, Each point is the average score achieved per episode 
after the agent is run with e-greedy policy (¢ = 0.05) for 520k frames on Space 
Invaders. b, Average score achieved per episode for Seaquest. c, Average 
predicted action-value on a held-out set of states on Space Invaders. Each point 
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symbolizes sliding of each filter across input image) and two fully connected 
layers with a single output for each valid action. Each hidden layer is followed 
by a rectifier nonlinearity (that is, max(0,x)). 


We compared DQN with the best performing methods from the 
reinforcement learning literature on the 49 games where results were 
available'*’’. In addition to the learned agents, we also report scores for 
a professional human games tester playing under controlled conditions 
and a policy that selects actions uniformly at random (Extended Data 
Table 2 and Fig. 3, denoted by 100% (human) and 0% (random) on y 
axis; see Methods). Our DQN method outperforms the best existing 
reinforcement learning methods on 43 of the games without incorpo- 
rating any of the additional prior knowledge about Atari 2600 games 
used by other approaches (for example, refs 12, 15). Furthermore, our 
DQN agent performed at a level that was comparable to that of a pro- 
fessional human games tester across the set of 49 games, achieving more 
than 75% of the human score on more than half of the games (29 games; 
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on the curve is the average of the action-value Q computed over the held-out 
set of states. Note that Q-values are scaled due to clipping of rewards (see 
Methods). d, Average predicted action-value on Seaquest. See Supplementary 
Discussion for details. 
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Figure 3 | Comparison of the DQN agent with the best reinforcement 
learning methods’” in the literature. The performance of DQN is normalized 
with respect to a professional human games tester (that is, 100% level) and 
random play (that is, 0% level). Note that the normalized performance of DQN, 
expressed as a percentage, is calculated as: 100 X (DQN score — random play 
score)/(human score — random play score). It can be seen that DQN 


see Fig. 3, Supplementary Discussion and Extended Data Table 2). In 
additional simulations (see Supplementary Discussion and Extended 
Data Tables 3 and 4), we demonstrate the importance of the individual 
core components of the DQN agent—the replay memory, separate target 
Q-network and deep convolutional network architecture—by disabling 
them and demonstrating the detrimental effects on performance. 

We next examined the representations learned by DQN that under- 
pinned the successful performance of the agent in the context of the game 
Space Invaders (see Supplementary Video 1 for a demonstration of the 
performance of DQN), by using a technique developed for the visual- 
ization of high-dimensional data called ‘t-SNE”* (Fig. 4). As expected, 
the t-SNE algorithm tends to map the DQN representation of percep- 
tually similar states to nearby points. Interestingly, we also found instances 
in which the t-SNE algorithm generated similar embeddings for DQN 
representations of states that are close in terms of expected reward but 
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outperforms competing methods (also see Extended Data Table 2) in almost all 
the games, and performs at a level that is broadly comparable with or superior 
to a professional human games tester (that is, operationalized as a level of 
75% or above) in the majority of games. Audio output was disabled for both 
human players and agents. Error bars indicate s.d. across the 30 evaluation 
episodes, starting with different initial conditions. 


perceptually dissimilar (Fig. 4, bottom right, top left and middle), con- 
sistent with the notion that the network is able to learn representations 
that support adaptive behaviour from high-dimensional sensory inputs. 
Furthermore, we also show that the representations learned by DQN 
are able to generalize to data generated from policies other than its 
own—in simulations where we presented as input to the network game 
states experienced during human and agent play, recorded the repre- 
sentations of the last hidden layer, and visualized the embeddings gen- 
erated by the t-SNE algorithm (Extended Data Fig. 1 and Supplementary 
Discussion). Extended Data Fig. 2 provides an additional illustration of 
how the representations learned by DQN allow it to accurately predict 
state and action values. 

It is worth noting that the games in which DQN excels are extremely 
varied in their nature, from side-scrolling shooters (River Raid) to box- 
ing games (Boxing) and three-dimensional car-racing games (Enduro). 
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Figure 4 | Two-dimensional t-SNE embedding of the representations in the 
last hidden layer assigned by DQN to game states experienced while playing 
Space Invaders. The plot was generated by letting the DQN agent play for 
2h of real game time and running the t-SNE algorithm” on the last hidden layer 
representations assigned by DQN to each experienced game state. The 

points are coloured according to the state values (V, maximum expected reward 
of a state) predicted by DQN for the corresponding game states (ranging 
from dark red (highest V) to dark blue (lowest V)). The screenshots 
corresponding to a selected number of points are shown. The DQN agent 


Indeed, in certain games DQN is able to discover a relatively long-term 
strategy (for example, Breakout: the agent learns the optimal strategy, 
which is to first dig a tunnel around the side of the wall allowing the ball 
to be sent around the back to destroy a large number of blocks; see Sup- 
plementary Video 2 for illustration of development of DQN’s perfor- 
mance over the course of training). Nevertheless, games demanding more 
temporally extended planning strategies still constitute a major chal- 
lenge for all existing agents including DQN (for example, Montezuma’s 
Revenge). 

In this work, we demonstrate that a single architecture can success- 
fully learn control policies in a range of different environments with only 
very minimal prior knowledge, receiving only the pixels and the game 
score as inputs, and using the same algorithm, network architecture and 
hyperparameters on each game, privy only to the inputs a human player 
would have. In contrast to previous work**”*, our approach incorpo- 
rates ‘end-to-end’ reinforcement learning that uses reward to continu- 
ously shape representations within the convolutional network towards 
salient features of the environment that facilitate value estimation. This 
principle draws on neurobiological evidence that reward signals during 
perceptual learning may influence the characteristics of representations 
within primate visual cortex”””*. Notably, the successful integration of 
reinforcement learning with deep network architectures was critically 
dependent on our incorporation ofa replay algorithm’ involving the 
storage and representation of recently experienced transitions. Conver- 
gent evidence suggests that the hippocampus may support the physical 
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predicts high state values for both full (top right screenshots) and nearly 
complete screens (bottom left screenshots) because it has learned that 
completing a screen leads to a new screen full of enemy ships. Partially 
completed screens (bottom screenshots) are assigned lower state values because 
less immediate reward is available. The screens shown on the bottom right 
and top left and middle are less perceptually similar than the other examples but 
are still mapped to nearby representations and similar values because the 
orange bunkers do not carry great significance near the end of a level. With 
permission from Square Enix Limited. 


realization of such a process in the mammalian brain, with the time- 
compressed reactivation of recently experienced trajectories during 
offline periods*’” (for example, waking rest) providing a putative mech- 
anism by which value functions may be efficiently updated through 
interactions with the basal ganglia”’. In the future, it will be important 
to explore the potential use of biasing the content of experience replay 
towards salient events, a phenomenon that characterizes empirically 
observed hippocampal replay”’, and relates to the notion of ‘prioritized 
sweeping”? in reinforcement learning. Taken together, our work illus- 
trates the power of harnessing state-of-the-art machine learning tech- 
niques with biologically inspired mechanisms to create agents that are 
capable of learning to master a diverse array of challenging tasks. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


Preprocessing. Working directly with raw Atari 2600 frames, which are 210 X 160 
pixel images with a 128-colour palette, can be demanding in terms of computation 
and memory requirements. We apply a basic preprocessing step aimed at reducing 
the input dimensionality and dealing with some artefacts of the Atari 2600 emu- 
lator. First, to encode a single frame we take the maximum value for each pixel colour 
value over the frame being encoded and the previous frame. This was necessary to 
remove flickering that is present in games where some objects appear only in even 
frames while other objects appear only in odd frames, an artefact caused by the 
limited number of sprites Atari 2600 can display at once. Second, we then extract 
the Y channel, also known as luminance, from the RGB frame and rescale it to 
84 X 84. The function ¢ from algorithm 1 described below applies this preprocess- 
ing to the m most recent frames and stacks them to produce the input to the 
Q-function, in which m = 4, although the algorithm is robust to different values of 
m (for example, 3 or 5). 

Code availability. The source code can be accessed at https://sites.google.com/a/ 
deepmind.com/dqn for non-commercial uses only. 

Model architecture. There are several possible ways of parameterizing Q using a 
neural network. Because Q maps history-action pairs to scalar estimates of their 
Q-value, the history and the action have been used as inputs to the neural network 
by some previous approaches**”*. The main drawback of this type of architecture 
is that a separate forward pass is required to compute the Q-value of each action, 
resulting in a cost that scales linearly with the number of actions. We instead use an 
architecture in which there is a separate output unit for each possible action, and 
only the state representation is an input to the neural network. The outputs cor- 
respond to the predicted Q-values of the individual actions for the input state. The 
main advantage of this type of architecture is the ability to compute Q-values for all 
possible actions in a given state with only a single forward pass through the network. 

The exact architecture, shown schematically in Fig. 1, is as follows. The input to 

the neural network consists of an 84 X 84 X 4 image produced by the preprocess- 
ing map @. The first hidden layer convolves 32 filters of 8 X 8 with stride 4 with the 
input image and applies a rectifier nonlinearity*'’. The second hidden layer con- 
volves 64 filters of 4 X 4 with stride 2, again followed by a rectifier nonlinearity. 
This is followed by a third convolutional layer that convolves 64 filters of 3 X 3 with 
stride 1 followed by a rectifier. The final hidden layer is fully-connected and con- 
sists of 512 rectifier units. The output layer is a fully-connected linear layer with a 
single output for each valid action. The number of valid actions varied between 4 
and 18 on the games we considered. 
Training details. We performed experiments on 49 Atari 2600 games where results 
were available for all other comparable methods'*”’. A different network was trained 
on each game: the same network architecture, learning algorithm and hyperpara- 
meter settings (see Extended Data Table 1) were used across all games, showing that 
our approach is robust enough to work on a variety of games while incorporating 
only minimal prior knowledge (see below). While we evaluated our agents on unmodi- 
fied games, we made one change to the reward structure of the games during training 
only. As the scale of scores varies greatly from game to game, we clipped all posi- 
tive rewards at 1 and all negative rewards at —1, leaving 0 rewards unchanged. 
Clipping the rewards in this manner limits the scale of the error derivatives and 
makes it easier to use the same learning rate across multiple games. At the same time, 
it could affect the performance of our agent since it cannot differentiate between 
rewards of different magnitude. For games where there is a life counter, the Atari 
2600 emulator also sends the number of lives left in the game, which is then used to 
mark the end of an episode during training. 

In these experiments, we used the RMSProp (see http://www.cs.toronto.edu/ 
~tijmen/csc321/slides/lecture_slides_lec6.pdf) algorithm with minibatches of size 
32. The behaviour policy during training was ¢-greedy with ¢ annealed linearly 
from 1.0 to 0.1 over the first million frames, and fixed at 0.1 thereafter. We trained 
for a total of 50 million frames (that is, around 38 days of game experience in total) 
and used a replay memory of 1 million most recent frames. 

Following previous approaches to playing Atari 2600 games, we also use a simple 
frame-skipping technique'*. More precisely, the agent sees and selects actions on 
every kth frame instead of every frame, and its last action is repeated on skipped 
frames. Because running the emulator forward for one step requires much less 
computation than having the agent select an action, this technique allows the agent 
to play roughly k times more games without significantly increasing the runtime. 
We use k = 4 for all games. 

The values of all the hyperparameters and optimization parameters were selected 
by performing an informal search on the games Pong, Breakout, Seaquest, Space 
Invaders and Beam Rider. We did not perform a systematic grid search owing to 
the high computational cost. These parameters were then held fixed across all other 
games. The values and descriptions of all hyperparameters are provided in Extended 
Data Table 1. 


Our experimental setup amounts to using the following minimal prior know- 

ledge: that the input data consisted of visual images (motivating our use of a con- 
volutional deep network), the game-specific score (with no modification), number 
of actions, although not their correspondences (for example, specification of the 
up ‘button’) and the life count. 
Evaluation procedure. The trained agents were evaluated by playing each game 
30 times for up to 5 min each time with different initial random conditions (‘no- 
op’; see Extended Data Table 1) and an «-greedy policy with ¢ = 0.05. This pro- 
cedure is adopted to minimize the possibility of overfitting during evaluation. The 
random agent served as a baseline comparison and chose a random action at 10 Hz 
which is every sixth frame, repeating its last action on intervening frames. 10 Hz is 
about the fastest that a human player can select the ‘fire’ button, and setting the 
random agent to this frequency avoids spurious baseline scores in a handful of the 
games. We did also assess the performance ofa random agent that selected an action 
at 60 Hz (that is, every frame). This had a minimal effect: changing the normalized 
DQN performance by more than 5% in only six games (Boxing, Breakout, Crazy 
Climber, Demon Attack, Krull and Robotank), and in all these games DQN out- 
performed the expert human by a considerable margin. 

The professional human tester used the same emulator engine as the agents, and 

played under controlled conditions. The human tester was not allowed to pause, 
save or reload games. As in the original Atari 2600 environment, the emulator was 
run at 60 Hz and the audio output was disabled: as such, the sensory input was 
equated between human player and agents. The human performance is the average 
reward achieved from around 20 episodes of each game lasting a maximum of 5 min 
each, following around 2h of practice playing each game. 
Algorithm. We consider tasks in which an agent interacts with an environment, 
in this case the Atari emulator, in a sequence of actions, observations and rewards. 
At each time-step the agent selects an action a; from the set of legal game actions, 
A={1,...,K}. The action is passed to the emulator and modifies its internal state 
and the game score. In general the environment may be stochastic. The emulator’s 
internal state is not observed by the agent; instead the agent observes an image 
x,€R‘ from the emulator, which is a vector of pixel values representing the current 
screen. In addition it receives a reward r; representing the change in game score. 
Note that in general the game score may depend on the whole previous sequence of 
actions and observations; feedback about an action may only be received after many 
thousands of time-steps have elapsed. 

Because the agent only observes the current screen, the task is partially observed** 
and many emulator states are perceptually aliased (that is, it is impossible to fully 
understand the current situation from only the current screen x;). Therefore, 
sequences of actions and observations, s;=%1,@1,X2,..-,4;—1,X;, are input to the 
algorithm, which then learns game strategies depending upon these sequences. All 
sequences in the emulator are assumed to terminate in a finite number of time- 
steps. This formalism gives rise to a large but finite Markov decision process (MDP) 
in which each sequence is a distinct state. As a result, we can apply standard rein- 
forcement learning methods for MDPs, simply by using the complete sequence s; 
as the state representation at time ¢. 

The goal of the agent is to interact with the emulator by selecting actions in a way 
that maximizes future rewards. We make the standard assumption that future rewards 
are discounted by a factor of y per time-step ( was set to 0.99 throughout), and 


define the future discounted return at time tas R; = yf ~ rw, in which T is the 


time-step at which the game terminates. We define the optimal action-value 
function Q*(s,a) as the maximum expected return achievable by following any 
policy, after seeing some sequence s and then taking some action a, Q*(s,a) = 
max, i[R;|s; =s,a; =a,7] in which z is a policy mapping sequences to actions (or 
distributions over actions). 

The optimal action-value function obeys an important identity known as the 
Bellman equation. This is based on the following intuition: if the optimal value 
Q* (s',a’) of the sequence s’ at the next time-step was known for all possible actions 
a’, then the optimal strategy is to select the action a’ maximizing the expected value 
of r+yQ*(s',a'): 


O"(s,a)  =Ey |r+ymax Q"(s',a’)|s,a 
a 


The basic idea behind many reinforcement learning algorithms is to estimate 
the action-value function by using the Bellman equation as an iterative update, 
Q;,,.;(s,a) =Ky [r+y max, Q;(s’,a’)|s,a]. Such value iteration algorithms converge 
to the optimal action-value function, Q; > Q* as i— oo. In practice, this basic approach 
is impractical, because the action-value function is estimated separately for each 
sequence, without any generalization. Instead, it is common to use a function approx- 
imator to estimate the action-value function, Q(s,a; 0)~Q"*(s,a). In the reinforce- 
ment learning community this is typically a linear function approximator, but 
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sometimes a nonlinear function approximator is used instead, such as a neural 
network. We refer to a neural network function approximator with weights 0 as a 
Q-network. A Q-network can be trained by adjusting the parameters 0; at iteration 
i to reduce the mean-squared error in the Bellman equation, where the optimal 
target values r+y max, Q*(s',a’) are substituted with approximate target values 
y=r+y maxy O(s' sa’; 0 ) , using parameters 0; from some previous iteration. 
This leads to a sequence of loss functions L;(0;) that changes at each iteration i, 


L;(0;) = Es,a,r [(Ey [y|s,a] —Q(s,a; 0:))"] 
= Eis,a,r,s! [(y- QAs,a; 0:))"] ae Esar[Vs yl] Bs 


Note that the targets depend on the network weights; this is in contrast with the 
targets used for supervised learning, which are fixed before learning begins. At 
each stage of optimization, we hold the parameters from the previous iteration 0; 
fixed when optimizing the ith loss function L,(0;), resulting in a sequence of well- 
defined optimization problems. The final term is the variance of the targets, which 
does not depend on the parameters 0; that we are currently optimizing, and may 
therefore be ignored. Differentiating the loss function with respect to the weights 
we arrive at the following gradient: 


VoL(0:)  =Esars | (r+rmax O(s,a'; 0; ) — O(s,a; 0)) Vo, O(s,4; 0)| : 


Rather than computing the full expectations in the above gradient, it is often 
computationally expedient to optimize the loss function by stochastic gradient 
descent. The familiar Q-learning algorithm’? can be recovered in this framework 
by updating the weights after every time step, replacing the expectations using 
single samples, and setting 0; = 0j—1. 

Note that this algorithm is model-free: it solves the reinforcement learning task 

directly using samples from the emulator, without explicitly estimating the reward 
and transition dynamics P(r,s’|s,a). It is also off-policy: it learns about the greedy 
policy a=argmax,, Q(s,a’; 0), while following a behaviour distribution that ensures 
adequate exploration of the state space. In practice, the behaviour distribution is 
often selected by an e-greedy policy that follows the greedy policy with probability 
1 — and selects a random action with probability ¢. 
Training algorithm for deep Q-networks. The full algorithm for training deep 
Q-networks is presented in Algorithm 1. The agent selects and executes actions 
according to an é-greedy policy based on Q. Because using histories of arbitrary 
length as inputs to a neural network can be difficult, our Q-function instead works 
ona fixed length representation of histories produced by the function ¢ described 
above. The algorithm modifies standard online Q-learning in two ways to make it 
suitable for training large neural networks without diverging. 

First, we use a technique known as experience replay~** in which we store the 
agent’s experiences at each time-step, ¢, = (Sp dp Tp 5; + 1), ina data set D, = {e},...,e, 
pooled over many episodes (where the end of an episode occurs when a termi- 
nal state is reached) into a replay memory. During the inner loop of the algorithm, 
we apply Q-learning updates, or minibatch updates, to samples of experience, 
(s,a,1, 8’) ~ U(D), drawn at random from the pool of stored samples. This approach 
has several advantages over standard online Q-learning. First, each step of experience 
is potentially used in many weight updates, which allows for greater data efficiency. 
Second, learning directly from consecutive samples is inefficient, owing to the strong 
correlations between the samples; randomizing the samples breaks these correla- 
tions and therefore reduces the variance of the updates. Third, when learning on- 
policy the current parameters determine the next data sample that the parameters 
are trained on. For example, if the maximizing action is to move left then the train- 
ing samples will be dominated by samples from the left-hand side; if the maximiz- 
ing action then switches to the right then the training distribution will also switch. 
It is easy to see how unwanted feedback loops may arise and the parameters could get 
stuck in a poor local minimum, or even diverge catastrophically”®. By using experience 
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replay the behaviour distribution is averaged over many of its previous states, 
smoothing out learning and avoiding oscillations or divergence in the parameters. 
Note that when learning by experience replay, it is necessary to learn off-policy 
(because our current parameters are different to those used to generate the sam- 
ple), which motivates the choice of Q-learning. 

In practice, our algorithm only stores the last N experience tuples in the replay 
memory, and samples uniformly at random from D when performing updates. This 
approach is in some respects limited because the memory buffer does not differ- 
entiate important transitions and always overwrites with recent transitions owing 
to the finite memory size N. Similarly, the uniform sampling gives equal impor- 
tance to all transitions in the replay memory. A more sophisticated sampling strat- 
egy might emphasize transitions from which we can learn the most, similar to 
prioritized sweeping”. 

The second modification to online Q-learning aimed at further improving the 
stability of our method with neural networks is to use a separate network for gen- 
erating the targets y; in the Q-learning update. More precisely, every C updates we 
clone the network Q to obtain a target network Q and use Q for generating the 
Q-learning targets y; for the following C updates to Q. This modification makes the 
algorithm more stable compared to standard online Q-learning, where an update 
that increases Q(s,,a;) often also increases Q(s; + 1,@) for all aand hence also increases 
the target y;, possibly leading to oscillations or divergence of the policy. Generating 
the targets using an older set of parameters adds a delay between the time an update 
to Q is made and the time the update affects the targets y;, making divergence or 
oscillations much more unlikely. 

We also found it helpful to clip the error term from the update r+y max, Q 
(s’ sa: 0;) — Q(s,a; 0;) to be between —1 and 1. Because the absolute value loss 
function |x| has a derivative of —1 for all negative values of x and a derivative of 1 
for all positive values of x, clipping the squared error to be between — 1 and 1 cor- 
responds to using an absolute value loss function for errors outside of the (—1,1) 
interval. This form of error clipping further improved the stability of the algorithm. 
Algorithm 1: deep Q-learning with experience replay. 

Initialize replay memory D to capacity N 
Initialize action-value function Q with random weights 0 
Initialize target action-value function Q with weights 0~ = 0 
For episode = 1, M do 
Initialize sequence s; = {x,} and preprocessed sequence ¢, =¢(s1) 
For t= 1,T do 
With probability ¢ select a random action a, 
otherwise select a; =argmax, O($(s;),a; 0) 
Execute action a, in emulator and observe reward r, and image x; + 4 
Set 5,41 =5;,4¢,%;41 and preprocess $,,; =(s1+1) 
Store transition ($,,4),71.6;4,) in D 
Sample random minibatch of transitions (9.0.0 +1) from D 


rj if episode terminates at step j+1 
Set yj = rby maxy O(dj1.0' 0-) 


Perform a gradient descent step on (y -Q Ce 0) ) : with respect to the 
network parameters 0 
Every C steps reset Q=0 
End For 
End For 


otherwise 


31. Jarrett, K. Kavukcuoglu, K., Ranzato, M.A. & LeCun, Y.Whatis the best multi-stage 
architecture for object recognition? Proc. IEEE. Int. Conf. Comput. Vis. 2146-2153 
(2009). 

32. Nair, V. & Hinton, G. E. Rectified linear units improve restricted Boltzmann 
machines. Proc. Int. Conf. Mach. Learn. 807-814 (2010). 

33. Kaelbling, L. P., Littman, M. L. & Cassandra, A. R. Planning and acting in partially 
observable stochastic domains. Artificial Intelligence 101, 99-134 (1994). 
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Extended Data Figure 1 | Two-dimensional t-SNE embedding of the 
representations in the last hidden layer assigned by DQN to game states 
experienced during a combination of human and agent play in Space 
Invaders. The plot was generated by running the t-SNE algorithm” on the last 
hidden layer representation assigned by DQN to game states experienced 
during a combination of human (30 min) and agent (2h) play. The fact that 
there is similar structure in the two-dimensional embeddings corresponding to 
the DQN representation of states experienced during human play (orange 


points) and DQN play (blue points) suggests that the representations learned 
by DQN do indeed generalize to data generated from policies other than its 
own. The presence in the t-SNE embedding of overlapping clusters of points 
corresponding to the network representation of states experienced during 
human and agent play shows that the DQN agent also follows sequences of 
states similar to those found in human play. Screenshots corresponding to 
selected states are shown (human: orange border; DQN: blue border). 
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Extended Data Figure 2 | Visualization of learned value functions on two 
games, Breakout and Pong. a, A visualization of the learned value function on 
the game Breakout. At time points 1 and 2, the state value is predicted to be ~17 
and the agent is clearing the bricks at the lowest level. Each of the peaks in 
the value function curve corresponds to a reward obtained by clearing a brick. 
At time point 3, the agent is about to break through to the top level of bricks and 
the value increases to ~21 in anticipation of breaking out and clearing a 
large set of bricks. At point 4, the value is above 23 and the agent has broken 
through. After this point, the ball will bounce at the upper part of the bricks 
clearing many of them by itself. b, A visualization of the learned action-value 
function on the game Pong. At time point 1, the ball is moving towards the 
paddle controlled by the agent on the right side of the screen and the values of 
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all actions are around 0.7, reflecting the expected value of this state based on 
previous experience. At time point 2, the agent starts moving the paddle 
towards the ball and the value of the ‘up’ action stays high while the value of the 
‘down’ action falls to —0.9. This reflects the fact that pressing ‘down’ would lead 
to the agent losing the ball and incurring a reward of —1. At time point 3, 
the agent hits the ball by pressing ‘up’ and the expected reward keeps increasing 
until time point 4, when the ball reaches the left edge of the screen and the value 
of all actions reflects that the agent is about to receive a reward of 1. Note, 
the dashed line shows the past trajectory of the ball purely for illustrative 
purposes (that is, not shown during the game). With permission from Atari 
Interactive, Inc. 
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Extended Data Table 1 | List of hyperparameters and their values 


Hyperparameter Value Description 

minibatch size 32 Number of training cases over which each stochastic gradient descent (SGD) update 
is computed. 

replay memory size 1000000 SGD updates are sampled from this number of most recent frames. 

agent history length 4 The number of most recent frames experienced by the agent that are given as input to 


the Q network. 


The frequency (measured in the number of parameter updates) with which the target 


ler ee netwatk Update Nequensy 10000 network is updated (this corresponds to the parameter C from Algorithm 1). 


discount factor 0.99 Discount factor gamma used in the Q-learning update. 


4 Repeat each action selected by the agent this many times. Using a value of 4 results 


achonirepeat in the agent seeing only every 4th input frame. 


The number of actions selected by the agent between successive SGD updates. 
update frequency 4 Using a value of 4 results in the agent selecting 4 actions between each pair of 
successive updates. 


learning rate 0.00025 The learning rate used by RMSProp. 
gradient momentum 0.95 Gradient momentum used by RMSProp. 
squared gradient momentum 0.95 Squared gradient (denominator) momentum used by RMSProp. 
min squared gradient 0.01 Constant added to the squared gradient in the denominator of the RMSProp update. 
initial exploration 1 Initial value of € in €-greedy exploration. 
final exploration 0.1 Final value of € in €-greedy exploration. 
final exploration frame 1000000 a number of frames over which the initial value of € is linearly annealed to its final 
value. 
F A uniform random policy is run for this number of frames before learning starts and the 
replay start size 50000 : j F 
resulting experience is used to populate the replay memory. 
Maximum number of “do nothing” actions to be performed by the agent at the start of 
no-op max 30 


an episode. 


The values of all the hyperparameters were selected by performing an informal search on the games Pong, Breakout, Seaquest, Space Invaders and Beam Rider. We did not perform a systematic grid search owing 
to the high computational cost, although it is conceivable that even better results could be obtained by systematically tuning the hyperparameter values. 
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Extended Data Table 2 | Comparison of games scores obtained by DQN agents with methods from the literature’*?> and a professional 
human games tester 


Game a a ee aaa oes: ee 
Alien 227.8 939.2 103.2 6875 3069 (+1093) 42.7% 
Amidar 5.8 103.4 183.6 1676 739.5 (+3024) 43.9% 
Assault 222.4 628 537 1496 3359(+775) 246.2% 
Asterix 210 987.3 1332 8503 6012 (+1744) 70.0% 
Asteroids 719.1 907.3 89 13157 1629 (4542) 7.3% 
Atlantis 12850 62687 852.9 29028 85641(+17600) 449.9% 
Bank Heist 14.2 190.8 67.4 734.4 429.7 (+650) 57.7% 
Battle Zone 2360 15820 16.2 37800 26300 (+7725) 67.6% 
Beam Rider 363.9 929.4 1743 5775 6846 (+1619) 119.8% 
Bowling 23.1 43.9 36.4 154.8 42.4 (+88) 14.7% 
Boxing 0.1 44 9.8 4.3 71.8 (+8.4) 1707.9% 
Breakout 1.7 5.2 6.1 31.8 401.2 (+26.9) 1327.2% 
Centipede 2091 8803 4647 11963 8309(+5237) 63.0% 
Chopper Command 811 1582 16.9 9882 6687 (+2916) 64.8% 
Crazy Climber 10781 23411 149.8 35411 114103 (422797) 419.5% 
Demon Attack 152.1 520.5 0 3401 9711 (+2406) 294.2% 
Double Dunk -18.6 -13.1 -16 -15.5 -18.1 (42.6) 17.1% 
Enduro 0 129.1 159.4 309.6 301.8 (+24.6) 97.5% 
Fishing Derby -91.7 -89.5 -85.1 5.5 -0.8 (+19.0) 93.5% 
Freeway 0 19.1 19.7 29.6 30.3 (+0.7) 102.4% 
Frostbite 65.2 216.9 180.9 4335 328.3 (+250.5) 6.2% 
Gopher 257.6 1288 2368 2321 8520 (+3279) 400.4% 
Gravitar 173 387.7 429 2672 306.7 (+223.9) 5.3% 
H.E.R.O. 1027 6459 7295 25763 19950 (+158) 76.5% 
Ice Hockey -11.2 -9.5 -3.2 0.9 -1.6 (+2.5) 79.3% 
James Bond 29 202.8 354.1 406.7 576.7 (175.5) 145.0% 
Kangaroo 52 1622 8.8 3035 6740 (+2959) 224.2% 
Krull 1598 3372 3341 2395 3805 (+1033) 277.0% 
Kung-Fu Master 258.5 19544 29151 22736 23270 (+5955) 102.4% 
Montezuma's Revenge 0 10.7 259 4367 0 (+0) 0.0% 
Ms. Pacman 307.3 1692 1227 15693 2311(+525) 13.0% 
Name This Game 2292 2500 2247 4076 7257 (+547) 278.3% 
Pong -20.7 -19 -17.4 9.3 18.9 (+1.3) 132.0% 
Private Eye 24.9 684.3 86 69571 1788 (+5473) 2.5% 
Q*Bert 163.9 613.5 960.3 13455 10596 (+3294) 78.5% 
River Raid 1339 1904 2650 13513 8316 (+1049) 57.3% 
Road Runner 11.5 67.7 89.1 7845 18257 (+4268) 232.9% 
Robotank 2.2 28.7 12.4 11.9 51.6 (+4.7) 509.0% 
Seaquest 68.4 664.8 675.5 20182 5286(+1310) 25.9% 
Space Invaders 148 250.1 267.9 1652 1976 (+893) 121.5% 
Star Gunner 664 1070 9.4 10250 57997 (43152) 598.1% 
Tennis -23.8 -0.1 0 -8.9 -2.5 (+1.9) 143.2% 
Time Pilot 3568 3741 24.9 5925 5947 (+1600) 100.9% 
Tutankham 11.4 114.3 98.2 167.6 186.7 (+41.9) 112.2% 
Up and Down 533.4 3533 2449 9082 8456 (+3162) 92.7% 
Venture 0 66 0.6 1188 380.0 (4238.6) 32.0% 
Video Pinball 16257 16871 19761 17298 42684 (+16287) 2539.4% 
Wizard of Wor 563.5 1981 36.9 4757 3393 (+2019) 67.5% 
Zaxxon 32.5 3365 21.4 9173 4977 (+1235) 54.1% 


Best Linear Learner is the best result obtained by a linear function approximator on different types of hand designed features'*. Contingency (SARSA) agent figures are the results obtained in ref. 15. Note the 
figures in the last column indicate the performance of DQN relative to the human games tester, expressed as a percentage, that is, 100 x (DQN score — random play score)/(human score — random play score). 
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Extended Data Table 3 | The effects of replay and separating the target Q-network 


Game With replay, With replay, Without replay, Without replay, 
with target Q without target Q with target Q without target Q 
Breakout 316.8 240.7 10.2 3.2 
Enduro 1006.3 831.4 141.9 29.1 
River Raid 7446.6 4102.8 2867.7 1453.0 
Seaquest 2894.4 822.6 1003.0 275.8 
Space Invaders 1088.9 826.3 373.2 302.0 


DQN agents were trained for 10 million frames using standard hyperparameters for all possible combinations of turning replay on or off, using or not using a separate target Q-network, and three different learning 
rates. Each agent was evaluated every 250,000 training frames for 135,000 validation frames and the highest average episode score is reported. Note that these evaluation episodes were not truncated at 5 min 
leading to higher scores on Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames was shorter (10 million frames) as compared to the main results presented in 
Extended Data Table 2 (50 million frames). 
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Extended Data Table 4 | Comparison of DQN performance with lin- 
ear function approximator 


Game DQN Linear 
Breakout 316.8 3.00 
Enduro 1006.3 62.0 
River Raid 7446.6 2346.9 
Seaquest 2894.4 656.9 
Space Invaders 1088.9 301.3 


The performance of the DQN agent is compared with the performance ofa linear function approximator 
on the 5 validation games (that is, where a single linear layer was used instead of the convolutional 
network, in combination with replay and separate target network). Agents were trained for 10 million 
frames using standard hyperparameters, and three different learning rates. Each agent was evaluated 
every 250,000 training frames for 135,000 validation frames and the highest average episode score is 
reported. Note that these evaluation episodes were not truncated at 5 min leading to higher scores on 
Enduro than the ones reported in Extended Data Table 2. Note also that the number of training frames 
was shorter (10 million frames) as compared to the main results presented in Extended Data Table 2 


(50 million frames). 
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Evolution of the new vertebrate head by co-option 
of an ancient chordate skeletal tissue 


David Jandzik!?*, Aaron T. Garnett’, Tyler A. Squarel, Maria V. Cattell!, Jr-Kai Yu* & Daniel M. Medeiros? 


A defining feature of vertebrates (craniates) is a pronounced head 
that is supported and protected by a robust cellular endoskeleton. 
In the first vertebrates, this skeleton probably consisted of collage- 
nous cellular cartilage, which forms the embryonic skeleton of all ver- 
tebrates and the adult skeleton of modern jawless and cartilaginous 
fish. In the head, most cellular cartilage is derived from a migratory 
cell population called the neural crest, which arises from the edges 
of the central nervous system. Because collagenous cellular cartilage 
and neural crest cells have not been described in invertebrates’, the 
appearance of cellular cartilage derived from neural crest cells is con- 
sidered a turning point in vertebrate evolution’. Here we show that 
a tissue with many of the defining features of vertebrate cellular 
cartilage transiently forms in the larvae of the invertebrate chordate 
Branchiostoma floridae (Florida amphioxus). We also present evi- 
dence that during evolution, a key regulator of vertebrate cartilage 
development, SoxE, gained new cis-regulatory sequences that sub- 
sequently directed its novel expression in neural crest cells. Together, 
these results suggest that the origin of the vertebrate head skeleton 
did not depend on the evolution of a new skeletal tissue, as is com- 
monly thought, but on the spread of this tissue throughout the head. 
We further propose that the evolution of cis-regulatory elements near 
an ancient regulator of cartilage differentiation was a major factor 
in the evolution of the vertebrate head skeleton. 

The histological properties of cellular cartilage in larval and adult 
vertebrates have been studied for over a century, and many specialized 
cartilage subtypes have been identified’. In contrast to the diversity of 


a 

Neural tube Notochord 
Oral cirri Gill slits 
b 


Figure 1 | Development of the amphioxus oral skeleton. a, A metamorphic 
amphioxus larva. The oral region, shown in b-e, is boxed. b, The oral 
skeleton of a metamorphosed larva stained with alcian blue. c, d, Scanning 
electron micrographs of forming cirri in early metamorphic (c) and mid- 
metamorphic (d) larvae. Scale bar, 100 um. e, A scanning electron micrograph 
of adult oral cirri. Scale bar, 100 jim. f-h, Alcian blue staining of chondrocytes 
in a larval amphioxus cirrus (f), a larval Petromyzon marinus gill bar (g) 

and a larval zebrafish gill bar (h). The arrows point to chondrocytes that are 


cartilage in adults, at embryonic stages the head skeletons of all living 
vertebrates consist ofa single type of histologically distinct cellular carti- 
lage. This embryonic cartilage consists of tightly packed polygonal or disc- 
shaped cells that secrete a thin, homogeneous extracellular matrix material 
composed of fibrillar collagen and chondroitin sulphate proteoglycans’. 
While classical and modern histological examinations have not iden- 
tified a clear homologue of vertebrate cellular cartilage in invertebrates, 
a few invertebrates have cellular-cartilage-like endoskeletal elements. 
For example, among the protostomes, horseshoe crabs (Merostomata), 
cephalopod molluscs (Cephalopoda) and sabellid polychaete worms 
(Sabellidae) have cartilage-like tissues’, although their phylogenetic dis- 
tribution suggests that these tissues evolved independently of vertebrate 
cartilage, and of each other. Among the deuterostomes, both hemichor- 
dates and cephalochordates (amphioxus) have stiff, acellular pharyngeal 
endoskeletons that incorporate fibrillar collagen’. In addition, amphi- 
oxus has an oral skeleton that supports its tentacles (cirri) and that 
forms during metamorphosis (Fig. la-e). Although this oral skeleton 
is ensheathed in a thick integument that does not contain fibrillar col- 
lagen, recent work has shown that it has a cellularized core*’. Unlike 
other deuterostomes, urochordates lack rigid endoskeletal elements’, 
despite their status as the sister group to vertebrates. 

A recurring theme in evolutionary developmental biology is how struc- 
tures in distantly related taxa can arise via conserved developmental 
mechanisms, revealing unexpected homology. We decided to test whether, 
despite its unusual histology in adults, the oral skeleton of amphioxus 
develops via mechanisms that are homologous to those of vertebrate 


dividing perpendicular to the axis of growth and intercalating to form stacks of 
discoidal cells. i-k, Toluidine blue staining of a larval amphioxus cirrus (i), a 
larval lamprey gill bar (j) and a larval zebrafish gill bar (k). The nuclei are 
blue, and the acidic extracellular matrix of the cellular cartilage is purple. The 
asterisks indicate the paired nuclei of dividing chondrocytes. The arrows point 
to vacuoles in maturing chondrocytes. The arrowheads indicate the acidic 
extracellular matrix. Original magnification, 1,000 (f, i), 700 (g, j) and 
500 (h, k). 
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cellular cartilage. Because the amphioxus oral skeleton forms at the 
end of an extended planktonic larval phase that is difficult to obtain in 
the field, its development has never been described. Using new meth- 
ods for the continuous laboratory culture of the Florida amphioxus (B. 
floridae), together with a protocol for artificially inducing synchronized 
metamorphosis*, we were able to obtain live amphioxus larvae at vari- 
ous stages of oral skeleton development. We first stained these larvae 
with the classic histological stain for vertebrate cellular cartilage, alcian 
blue, and found strong specific reactivity with the nascent oral skeleton 
(Fig. 1b, f). Closer histological analysis of individual skeletal rods at this 
stage revealed discoidal cells with large vacuoles that were dividing per- 
pendicular to the axis of cirrus growth and were surrounded by an acidic 
extracellular matrix (Fig. 1f, i). This histology is highly similar to that 
of the gill bar cartilage in embryonic lampreys (Petromyzon marinus) 
and zebrafish (Danio rerio) (Fig. 1g, h, j,k). 

We next asked whether the development of the oral skeleton in amphi- 
oxus requires the same intercellular signalling pathways as the cellular 
cartilage in vertebrates. Fibroblast growth factor (FGF)-mediated signal- 
ling is a conserved essential regulator of cellular cartilage differentiation 
in both jawed and jawless vertebrates”’®. We thus exposed metamor- 
phosing amphioxus larvae to SU5402 and UO126 (refs 11, 12), inhibitors 
of FGF-mediated signalling that block cellular cartilage differentiation 
in vertebrates. Treatment with either inhibitor suppressed the forma- 
tion of the oral skeleton in metamorphosing amphioxus larvae (Fig. 2a—c 
and Extended Data Fig. 1c-h). Importantly, these inhibitors did not 
generally inhibit development, as the treated larvae displayed other tem- 
porally appropriate signs of metamorphosis, including formation of 
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Figure 2 | The amphioxus oral skeleton requires FGF-mediated signalling 
for formation and expresses orthologues of vertebrate cartilage markers. 
a, A phase contrast image of cirri in a control larva treated with T3 thyroid 
hormone, which induces metamorphosis in amphioxus. Ninety-six per cent 
of T3-treated control larvae (22 of 23, from two experiments) developed a 
normal oral skeleton (arrow) after 4-5 days, as previously reported’. In situ 
hybridization (d-g) was performed on sections at the level of the dashed line. 
b, c, Representative phase contrast images of larvae treated with T3 and 

50 uM SU5402 (b) or T3 and 10 uM UO126 (c). All SU5402-treated larvae 
(26 of 26, from two experiments) and 83% of UO126-treated larvae (10/12, 
from one experiment) lacked oral cirri (arrows) but displayed other signs of 
metamorphosis, including metapleural folds, ventral mouth migration and 
secondary gill bar formation. d-g, Expression of amphioxus fibrillar collagen 
(ColA) (d), SoxE (e), SoxD (f) and DUSP6/7/9 (g) mRNA (shown in blue), 
as determined by in situ hybridization, in the oral region of metamorphic 
amphioxus larvae. Expression in chondrocytes is indicated by arrows, and 
expression in the mesothelium is indicated by arrowheads. 
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the secondary gill bars and metapleural folds and ventral migration of 
the mouth’. 

A defining feature of vertebrate cellular cartilage is the secretion of 
fibrillar collagen, the major protein component of its extracellular matrix. 
Although previous histological examinations of amphioxus failed to find 
evidence of collagen fibres in oral cirri®, amputation of adult tentacles 
leads to the transcription of fibrillar collagen and SoxE messenger RNA 
at the site of amputation’. We assayed amphioxus fibrillar collagen (ColA) 
expression during metamorphosis, when the oral skeleton is differen- 
tiating. We found intense expression of Co/A in the chondrocytes of 
nascent oral cirri and in the mesothelium that lines the anterior coeloms 
(Fig. 2d and Extended Data Fig. 2b, g-i). 

In the cellular cartilage of jawed vertebrates, fibrillar collagen expression 
and chondrocyte differentiation are activated by the transcription factor 
SOX9, a SOXE family member and key regulator of chondrogenesis”. 
The binding of SOX9 to the promoter of the gene encoding fibrillar 
collagen, Col2a1, is facilitated by transcription factors of the Soxd 
subfamily", while the expression of Sox9 depends on FGF-mediated 
signalling through FGF receptors (FGFRs)’”. In lampreys, SoxE genes’® 
and FGF-mediated signalling through FGFRs"° are also required for 
the proper differentiation of neural crest cell (NCC)-derived chondro- 
cytes, while SoxD is expressed in prechondrocytes"’. These results sug- 
gest that SOXE, SOXD and FGF signalling are ancient core components 
of the vertebrate cartilage gene program. We thus assessed whether the 
developing amphioxus oral skeleton expresses SoxE, SoxD and FGF 
signalling pathway components. We detected SoxE, SoxD and FGFR 
mRNA in oral chondrocytes and the surrounding mesothelium (Fig. 2e, f 
and Extended Data Fig. 2c, d). We observed similar expression of Ets 
and DUSP6/7/9, homologues of two FGF-mediated signalling target 
genes expressed in chondrogenic NCCs in zebrafish, mice and Xenopus 
laevis'*” (Fig. 2g and Extended Data Fig. 2e, f, j, k). Taken together, our 
data show that the developing amphioxus oral skeleton displays the core, 
conserved histological, developmental and molecular features of ver- 
tebrate embryonic cellular cartilage. 

Most of the cellular cartilage in the vertebrate head is derived from 
NCCs. Amphioxus lacks NCCs and probably forms its oral skeleton from 
the mesothelium that lines the anterior coeloms”’. This implies that the 
genetic program for generating cellular cartilage was primitively deployed 
in mesendoderm and later recruited by NCCs. The repurposing of 
ancient genes and genetic programs is a recognized way in which novelty 
arises during evolution. However, it is unresolved whether such co- 
option events are typically driven by changes in cis-regulatory sequences, 
changes in the function of transcription factors or a combination of the 
two. In all vertebrates that have been examined, including lampreys, SOXE 
transcription in NCCs is activated by the transcription factor TFAP2 
and is maintained by auto-regulation between SOXE family members**™. 
These regulatory interactions are direct and involve the physical bind- 
ing of TFAP2 and SOXE paralogues to SOXE enhancers in NCCs”"”. 
Previous work has shown that, despite new roles in NCC development, 
neither TFAP2 nor SOXE has acquired new DNA-binding properties 
in vertebrates*°. This implies that the novel expression of SoxE in 
NCCs was driven by changes in SoxE cis-regulatory sequences rather 
than by changes in transcription factor function. To directly test this 
hypothesis, we performed an interspecific assay of SoxE cis regulation. 
To ensure that all of the relevant amphioxus cis-regulatory sequences 
were queried, we built a bacterial artificial chromosome (BAC) reporter 
construct that contains the entire amphioxus SoxE locus and several 
flanking genes, and we tested this reporter in zebrafish (Fig. 3b). We 
reasoned that if changes in SoxE cis regulation were involved in the re- 
cruitment of SoxE to NCCs, then the reporter should recapitulate the 
expression pattern of endogenous SoxE in amphioxus (Fig. 3a). Alter- 
natively, if SoxE cis regulation has been conserved across chordates, the 
reporter should be active in NCCs, as shown previously for amphioxus 
Hox enhancers”. We observed reporter activity transiently throughout 
the neural tube and tail-bud region in early neurulae but not in migrat- 
ing NCCs or chondrocytes at any developmental stage (Fig. 3c-f and 


26 FEBRUARY 2015 | VOL 518 | NATURE | 535 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


83.5 kb 103 kb 


pTARBAC2.1 Tol2 sites 
KI 


Figure 3 | A reporter construct incorporating the amphioxus SoxE locus 
recapitulates the amphioxus SoxE expression pattern in zebrafish embryos. 
a, Transient expression of amphioxus SoxE transcripts (purple) in scattered 
neural tube cells (double arrowhead), the posterior axial mesoderm including 
the tail bud (arrow) and the anterior mesendoderm (arrowhead) in a 20-h, 
12-somite neurula, as previously described”’. b, Schematic of the amphioxus 
SoxE reporter construct. The construct spans the SoxE locus and 186 kilobases 
(kb) of flanking sequence, including adjacent genes, and is roughly equivalent 
to 1.3 million base pairs of human genomic sequence. The gene encoding 
green fluorescent protein (GFP) was fused in frame with the first exon (ex1), 
and Tol2 recombination arms were added to facilitate genomic integration. 
ampR, ampicillin resistance. c, A zebrafish embryo injected at the one-cell stage 
with the amphioxus SoxE BAC reporter construct and probed for GFP mRNA 
at the 16-h, 15-somite stage. Mosaic transcription of GFP (blue) was 
consistently observed in the neural tube (double arrowhead) and tail-bud 
region (arrow). d, A zebrafish embryo injected at the one-cell stage with the 


Extended Data Fig. 3). This pattern closely resembles endogenous am- 
phioxus SoxE expression in the embryonic neural tube and tail-bud” 
(Fig. 3a), regions that do not co-express Tfap2 and SoxE. These results, 
and the lack of endogenous soxE in the lateral and ventral neural tube 
of zebrafish embryos, suggest that the amphioxus SoxE reporter and, 
by extension, endogenous amphioxus SoxE are not regulated by Tfap2 
or SoxE proteins. 

To verify this finding, we assessed whether the reporter functioned 
properly in zebrafish embryos with depleted tfap2 function. While 
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amphioxus SoxE BAC reporter and probed for GFP mRNA at the 22-h, 
26+-somite stage. Like endogenous amphioxus SoxE, the SoxE BAC reporter 
expression was transient and was extinguished by the late neurula stages, 
when vertebrate SOXE genes mark migrating and post-migratory NCCs. 

e, A dorsal close-up view of the embryo in c at the level of the hindbrain. 
Activity of the amphioxus SoxE BAC reporter occurred throughout the neural 
tube but not in the surrounding mesenchyme, which included some early 
migrating NCCs. The image is a composite of three photographs of the same 
embryo taken at different focal planes. f, Double in situ hybridization for 
zebrafish sox10 (a SoxE co-orthologue) (red) and GFP (blue) transcripts in 
an 18-somite zebrafish injected with the SoxE BAC reporter construct. The 
image is a composite of three photographs of the same embryo taken at 
different focal planes. GFP expression was limited to the neural tube and did not 
overlap with sox10 expression in the otic placode (asterisks) and migrating 
NCCs (arrows). 


TFAP2A and TFAP2C dual knockdown eliminates the expression of 
sox10 and sox9b (amphioxus SoxE co-orthologues) in zebrafish””®, the 
activity of the amphioxus SoxE reporter was similar in wildtype zebra- 
fish and in tfap2a and tfap2c dual knockdown morphants (Extended 
Data Fig. 3). In the context of previous work, these data suggest that 
amphioxus and vertebrate SoxE cis regulation diverged, in part, through 
the evolution of new non-coding cis-regulatory sequences of tfap2 and 
SoxE. Whether these sequence changes were accompanied by changes 
in the function of other transcription factors is unclear from our assay. 

Our results suggest that the nascent oropharyngeal skeleton of early 
chordates incorporated collagenous cellular cartilage that is strikingly 
similar to vertebrate cartilage. In amphioxus, this tissue supports the 
oral tentacles, which prevent the ingestion of large particles during filter 
feeding and burrowing. It is likely that similar structures were present 
in the invertebrate ancestor of the vertebrates, as oral tentacles are found 
in lamprey larvae, adult hagfish and the chordate fossil Haikouella 
lanceolata’. We posit that the evolution of the vertebrate head skeleton 
involved two major developmental changes: the spread of collagenous 
cellular cartilage from the oral region into the pharynx and the head, 
and the novel differentiation of cellular cartilage from NCCs (Fig. 4). 
Our data also suggest that the acquisition of chondrogenic potential by 
NCCs involved the evolution of new transcription-factor-binding sites 


Figure 4 | The evolution of the vertebrate head skeleton via co-option of an 
ancient cellular cartilage gene program. a, A hypothetical early chordate with 
an oral skeleton consisting of mesendoderm-derived cellular cartilage (pink 
polygonal cells) and a pharyngeal skeleton of acellular cartilage (pink rods). 
SoxE controls cellular cartilage differentiation in the oral region and has 

an unrelated function in a subset of central nervous system precursors 

(blue stellate cells). b, An early pre-vertebrate chordate with migratory 
non-skeletogenic proto-NCCs expressing SoxE (blue stellate cells in the 
oropharyngeal region). c, Exposure to intercellular signals in the oral region 
activates the cellular cartilage gene program in SoxE-expressing proto-NCCs 
(blue polygonal cells). d, Alternatively, mesendoderm-derived cellular cartilage 
(pink polygonal cells) spreads throughout the pharynx, before being replaced 
by NCC-derived cellular cartilage (e). e, Subsequent to c or d, invasion of 

the pharynx by chondrogenic NCC-derived cellular cartilage gives rise to the 
NCC-derived head skeleton of vertebrates (blue polygonal cells), similar to the 
model proposed by Rychel and Swalla’. 
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at the SoxE locus. In addition to the oral cartilage, SoxE is expressed in 
the embryonic central nervous system of amphioxus. One of the first 
steps in the evolution of NCC-derived cellular cartilage may have been 
cis-regulatory mutations that maintained SoxE expression in non- 
skeletogenic proto-NCCs emerging from the central nervous system. 
Later, when these SoxE-expressing migratory cells mixed with mesen- 
doderm in the oral region, exposure to skeletogenic signals such as FGFs 
could have induced their differentiation into chondrocytes (Fig. 4). Thus, 
the dual roles of the ancestral chordate SoxE gene in neural and skeletal 
development may have predisposed evolving NCCs towards acquir- 
ing chondrogenic ability, potentiating the evolution of the vertebrate 
“new head”. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


Amphioxus adults were fed a diet of live unicellular algae and allowed to spawn 
spontaneously, en masse, in closed aquaria containing 60-150 animals. The larvae 
from mass spawning events were collected and raised on live unicellular algae and 
then induced to metamorphose by exposure to T3 thyroid hormone”. Experiments 
on vertebrate animals were carried out in accordance with the guidelines of the 
University of Colorado, Boulder, Institutional Animal Care and Use Committee. 

The larvae were fixed, embedded in paraffin, sectioned to 10 jum and in situ hybrid- 
ized to riboprobes using previously reported methods". The Ets, ColA, SoxE and 
SoxD riboprobes were made as previously described””. The DUSP6/7/9 and DUSP1/ 
4/5 riboprobes were synthesized using cDNA clones GC035f15 and GC034j09, 
respectively, from the B. floridae Gene Collection Release 1 (ref. 32). The FGFR 
riboprobe was synthesized using cDNA clone CAXG1602 from the B. floridae cDNA 
library, CAXG (NCBI EST accession, FE584207)**. 

Alcian blue and toluidine blue staining of 4-11m sections embedded in JB-4 plastic 
resin (Polysciences) and scanning electron microscopy were performed according 
to standard protocols. SU5402 and UO126 (Tocris Bioscience) were dissolved in 
dimethyl sulphoxide before dilution in seawater for larval treatments. The control 
and experimental larvae were selected at random from the same mass spawnings. 
The scoring of treated and control larvae for oral cirrus development was not 
blinded. 

The BAC recombination-mediated genetic engineering was carried out as previ- 
ously described™, with modifications as follows. A region surrounding the tran- 
scription start site of SoxE was amplified using the primers 5'-GCATggcgcgccGAG 
GATAGGTACCTATCCGTC-3’ and 5'-GCATccatggGCACCAGCGTCCAGTC 
GTACC-3’, digested with Ascl and Ncol and ligated into pLD53.SC2. The result- 
ant plasmid was used to insert GFP in frame with SoxE ina BAC from the CHORI- 
302 library (BACPAC Resources Center, accession CH302-98N3; https://bacpac. 
chori.org). Tol2 sites were then inserted into the pTARBAC2.1 backbone. To do 
this, a plasmid was constructed by inserting the kanamycin resistance gene and a 
441-base-pair-targeting sequence from the pTARBAC2.1 backbone into the plasmid 
pDest Tol2pA using Gateway cloning’. The primers 5'-GCATactagtAGAGGT 


TTGTCCAGGAGTTC-3’ and 5’-GCATactagtGCTGGGCTTGCTGAAGGTA 
GG-3’ were used to amplify a cassette from this plasmid containing kanamycin 
resistance and the pTARBAC2.1-targeting sequence flanked by Tol2 recognition 
sites. The resultant fragment was digested with Spel and circularized using DNA 
ligase. Then, bacteria containing the SoxE BAC and the pSV1-RecA plasmid were 
transformed with this circular DNA. Bacteria containing the recombinant BAC 
were then selected. The resultant reporter was then coinjected with mRNA encod- 
ing Tol2 recombinase” into AB strain zebrafish embryos at the one-to-two cell 
stage. For some embryos, this injection mix included splice-blocking morpholino 
antisense oligonucleotides (MOs) against tfap2a and tfap2c (tfap2at+c MO)”. 
Injected embryos expressing high levels of GFP protein, as revealed by fluor- 
escence microscopy, were then fixed and in situ hybridized to GFP and sox10 probes. 
Several embryos that were coinjected with tfap2a+c MO and the BAC reporter were 
allowed to develop past the 18-somite stage to verify the high penetrance (16/18) 
reported for the tfap2a+c morphant phenotype”. 
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Extended Data Figure 1 | Amphioxus oral cartilage differentiation is rectangle; arrows in f) embedded in the rim of the mouth. d, e, Toluidine blue 
initiated before cirrus outgrowth and requires FGF-mediated signalling. staining of two representative SU5402-treated larvae, sectioned at the same 
a, b, Phase contrast images of a metamorphic amphioxus larva. a, The level as the larva in c. The differentiation of the oral cartilage bar is completely 
differentiation of cartilage rods occurs first in the rim of the mouth. b, The eliminated in d and is strongly reduced in e. f-h, High magnification views 


cartilage rods later grow outwards into nascent cirri. c, Toluidine blue staining __ of the sections in c-e. 
of a JB-4-embedded control larval section showing the oral cartilage rod (red 
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Extended Data Figure 2 | Expression of SoxE, ColA and FGF signalling 
components in the oral region of metamorphic amphioxus larvae. 

a-l, In situ hybridization was performed on transverse sections in the planes 
shown in a. b, A high magnification view of cirri hybridized with a probe 
against ColA. ColA mRNA was detected in the central cartilage rod, which is a 
single stack of discoidal chondrocytes. c, A high magnification view of cirri 
hybridized with a probe against SoxE. SoxE mRNA was detected in the central 
cartilage rod. d, A high magnification view of cirri hybridized with a probe 


x 


DUSP6/7/9 


DUSP6/7/9 


against FGFR. e, f, High magnification views of cirri hybridized with a probe 
against Ets. g-i, High magnification views of cirri hybridized with a probe 
against ColA. j, k, High magnification views of cirri hybridized with a 

probe against DUSP6/7/9.1, A high magnification view of cirri hybridized with a 
probe against DUSP1/4/5, an orthologue of mouse Dusp4, a gene that is 
expressed at high levels in mesoderm-derived chondrocytes”. In all panels, 
the arrows indicate expression in oral chondrocytes, and the arrowheads 
indicate expression in the associated oral mesothelial cells. 
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s ° a? 
dl 
15 somites -W~ 18 somites 

d 

Domain 15 somites (n=50) 18 somites (n=48) 4 days (n=66) 18 somites (n=53) 
tfap2 depleted 

!) Neural tube/CNS 34 (68%) 30 (63%) 1 (2%) 43 (81%) 

|” Pharynx/head cartilage 1 (2%) 1 (2%) 0 (0%) 0 (0%) 

M@) Anterior ventral 3 (6%) 3 (6%) 6 (9%) 1 (2%) 

mesoderm/heart region 
Ml Trunk mesoderm/muscle 45 (90%) 1 (85%) 14 (21%) 30 (57%) 
!) Tailbud region 42 (84%) 38 (79%) 0 (0%) 31 (58%) 


Extended Data Figure 3 | Activity of the amphioxus SoxE reporter construct 
in zebrafish during development. Injected zebrafish embryos displaying 
broad GFP fluorescence and normal morphology were processed for in situ 
hybridization to detect GFP mRNA. Embryos were scored for the expression of 
GFP in five or more cells in each domain. a, A map of the expression domains 
scored in 15-somite and 18-somite embryos. b, A representative 4-day larva 
showing sporadic expression in the heart region and trunk muscles (arrows in 
insets). The scored domains are outlined. The image is a composite of eight 
photographs of the same larva taken at different focal planes. c, A representative 
18-somite tfap2a and tfap2c dual knockdown morphant (tfap2a/c morphant)”* 
expressing the amphioxus SoxE reporter. The image is a composite of four 


photographs of the same embryo taken at different focal planes. The tfap2a/c 
morphants” displayed a highly penetrant and almost complete loss of NCC 
marker expression, including sox10 (a co-orthologue of amphioxus SoxE). Like 
wildtype embryos, tfap2a/c morphant embryos displayed mosaic expression 
of the amphioxus SoxE reporter in the neural tube (double arrowheads) and tail 
bud (arrow). The ability of the amphioxus SoxE reporter to function in 
tfap2-depleted embryos supports expression data”’ suggesting that amphioxus 
SoxE transcription is Tfap2 independent. d, The numbers and frequencies of 
embryos and larvae with GFP expression in the indicated domains. The 
numbers in each column are the pooled results of two to four separate 
experiments. 
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Experimentally induced innovations lead to 
persistent culture via conformity in wild birds 


Lucy M. Aplin’, Damien R. Farine!?*, Julie Morand-Ferron”, Andrew Cockburn’, Alex Thornton® & Ben C. Sheldon!” 


In human societies, cultural norms arise when behaviours are trans- 
mitted through social networks via high-fidelity social learning’. 
However, a paucity of experimental studies has meant that there is 
no comparable understanding of the process by which socially trans- 
mitted behaviours might spread and persist in animal populations”. 
Here we show experimental evidence of the establishment of foraging 
traditions in a wild bird population. We introduced alternative novel 
foraging techniques into replicated wild sub-populations of great tits 
(Parus major) and used automated tracking to map the diffusion, 
establishment and long-term persistence of the seeded innovations. 
Furthermore, we used social network analysis to examine the social 
factors that influenced diffusion dynamics. From only two trained 
birds in each sub-population, the information spread rapidly through 
social network ties, to reach an average of 75% of individuals, with a 
total of 414 knowledgeable individuals performing 57,909 solutions 
over all replicates. The sub-populations were heavily biased towards 
using the technique that was originally introduced, resulting in es- 
tablished local traditions that were stable over two generations, de- 
spite a high population turnover. Finally, we demonstrate a strong 
effect of social conformity, with individuals disproportionately adopt- 
ing the most frequent local variant when first acquiring an innova- 
tion, and continuing to favour social information over personal 
information. Cultural conformity is thought to be a key factor in the 
evolution of complex culture in humans*”. In providing the first 
experimental demonstration of conformity in a wild non-primate, 
and of cultural norms in foraging techniques in any wild animal, our 
results suggest a much broader taxonomic occurrence of such an ap- 
parently complex cultural behaviour. 

Social learning, in which animals learn from others, can enable novel 
behaviours to spread between individuals, creating group-level behav- 
iours, including traditions and culture®*”. Social transmission occurs 
between interacting individuals; hence, group dynamics and popula- 
tion structure will determine the spread and persistence of traditions**”". 
Additionally, individuals may strategically use social learning to maxi- 
mize its adaptive value, with consequences for when, how and what tra- 
ditions are established*!*. However, while the capacity for social learning 
has been described in many phylogenetically diverse taxa’* and has been 
detailed in comprehensive laboratory studies'*"°, we have little know- 
ledge of the social dynamics associated with such learning in natural 
systems. Experimentally quantifying cultural transmission in wild popu- 
lations remains difficult, with limitations associated with isolating and 
training individuals’, tracking the spread of information across large 
numbers of animals” and eliminating alternative explanations such as 
individual trial-and-error learning*”*. 

Early observational studies of tits provide one of the most widely cited 
examples of animal innovation and culture, when British birds famously 
began to pierce the foil caps of milk bottles to take the cream’*"*. More 
generally, great tits (P. major) are known to be highly innovative, opportu- 
nistic foragers’? and to use social information in a wide range of contexts”. 


This life history, coupled with their fission—fusion social structure”, 
makes them excellent models for a large-scale empirical investigation 
of the social processes associated with cultural transmission. Here we 
used a novel system that incorporates automated data collection and 
passive integrated transponder tags, together with recently developed 
methods for social network analysis, to investigate the spread, estab- 
lishment and persistence of experimentally seeded traditions in wild 
great tits. 

We first developed an automated puzzle box that is baited with live 
mealworms (Fig. la), and performed a cultural diffusion experiment 
based on the two-action and control design"* but where treatment groups 
were exposed to a demonstrator trained on one of two distinct but equiv- 
alent actions. Two resident males were caught from each of eight sub- 
populations and exposed to one of three training regimens in captivity. 
In the first condition (‘control’), for which there were three sub-populations 
(that is, three replicates), neither individual was given any training. In 
the second condition (‘option A’; two replicates), both individuals were 
trained to access food from the puzzle box by using their bill to move 
the blue side of the sliding door from left to right. Last, in the third con- 
dition (‘option B’; three replicates), the birds were trained to solve the 
puzzle box by moving the red side of the sliding door from right to left 
(Supplementary Video 1). After 4 days of training, all birds were re- 
leased back into the wild, and three puzzle boxes, with both options 
available, were installed 250 m apart in each sub-population (Extended 
Data Fig. 1). We then automatically monitored the individual visits to, 
and solutions (‘solves’) at these puzzle boxes, over short-term (20 days 
of exposure over 4 weeks) and long-term (5 days of exposure, 9 months 
later) periods. 

In the five sub-populations that were seeded with trained demonstra- 
tors, knowledge of how to solve the novel puzzle spread rapidly over 
20 days of exposure (Fig. 1b). A mean of 75% of the members of each local 
population (68-83%, n = 37-96) solved the puzzle box at least once. 
The diffusion of this behaviour was clearly sigmoidal (sigmoidal versus 
linear fit, change in Akaike information criterion (AAIC) = 15.31-54.17), 
except in one replicate (T5, AAIC = 0.13). By contrast, many fewer 
individuals solved the puzzle box in control sub-populations (9-53%, 
n = 5-46; Fig. 1b), in which uptake initially relied on individual inno- 
vation. The latency to the first solve, excluding the demonstrator, was 
significantly longer in control areas than in treatment areas (Welch’s 
two-sample t-test, f(6) = —16.1, P< 0.01; Fig. 1b), and the total num- 
ber of solutions was significantly lower (t(¢) = 4.6, P = 0.02; Fig. 1c). 
There was a striking difference between the replicates that were seeded 
with alternative solving techniques. In all treatment sub-populations, 
learning was heavily biased towards the technique that was originally 
demonstrated (t(g) = 9.7, P< 0.01; Fig. 1c), while no consistent side 
bias was observed between the control sub-populations (t(4) = —0.03, 
P= 0.97; Fig. 1c). 

We derived the social network for each sub-population independently 
of the social learning experiment, with 10 days’ sampling at a grid of 
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Figure 1 | Cultural diffusion experiment. a, A puzzle box in which birds can 
slide the door open in two directions (from the left, option A; or the right, 
option B) to access a reward. The puzzle box records the identity, visit duration 
and solution choice, and it resets after each visit. b, Diffusion curves for the 
treatment sub-populations, with demonstrators (T1-5; n = 91, 130, 132, 50, 90, 
respectively), and the control sub-populations, without demonstrators (C1-3; 


sunflower-seed feeders that had been equipped to record visitation data 
(Extended Data Fig. 2a, b). Co-occurrences (see Methods) were detected 
using a Gaussian mixture model to isolate clusters of visits in the spatio- 
temporal data streams”, with repeated foraging associations between 
individuals forming the basis of social networks (Extended Data Fig. 2b, c). 
The social networks for all replicates were significantly non-random, 
even at the most local scale (T1-5, P < 0.001), and network-based dif- 
fusion analysis was used to quantify the extent to which these social ties 
predicted the acquisition of behaviour’. From pooled replicate data, a 
network diffusion model that included social transmission was over- 
whelmingly supported over asocial learning (AAIC = 1520.7); the learn- 
ing rate was estimated to increase by a factor of 12.0 per unit of association 
with knowledgeable individuals (Extended Data Fig. 3). An effect of 
age and sex was also supported, with juveniles and males having a faster 
learning rate (Table 1). These results support a dominant effect of social 
learning on the emergence of this novel behaviour and also demonstrate 
that the diffusion of innovation was influenced by the fine-scale patterns 
of social interactions (Supplementary Video 3). 

In all of the experimental replicates, the alternative solution, which 
was equally difficult and equally rewarded, was performed by at least one 
individual within the first 6 days of exposure (median, day 4). However, 
in contrast to most previous studies, in which discovery of an alternative 
solution led to the progressive erosion of the use of the seeded variant 
behaviour”*”, we observed a pronounced strengthening of traditions 


Table 1 | Network-based diffusion analysis 


1.05 = -1.0 
B OptionA T 

5 12,000, MOptionB fF = 
& = = L0.8 S 
2 0.84 8 
8 1 3 
‘a _ 

2 06-4 2 ee 
7 2 8,000 $ 
o = Ss 
2 re} > 
8 0.44 hi 1 ~ 

= xs) 1 z 
e S 8 
8 ee : 2 4,000-4 3 
2 0.24 / 2/* ag @ g 
° * L:) <= 
a o , AY 88 * =f @ 

at, x * - 
By > . yVVV 
00S See RRR TO VV 
10 15 20 C1 C2 C3 TA, “12 T3 T4 TS 
Control Option A Option B 


Sub-population 


n = 56, 87, 61, respectively). c, The total number of solutions using each option 
in each replicate (sub-population) (left y axis, shown as stacked bars). The 
average proportion (dots) of option A performed by individuals, with 95% CI 
(bars) is shown (right y axis). The total number of solvers was 5, 46 and 19 for 
C1, C2 and C3 (controls), respectively; 76 and 89 for T1 and T2 (option A), 
respectively; and 96, 37 and 69 for T3, T4 and T5 (option B), respectively. 


over the rest of the experiment. To analyse this change in behaviour 
over time, we used a generalized estimating equation model’ where the 
dependent variable was the proportion of solutions using the seeded 
technique on each day of data collection and the explanatory variables 
were individuals and replicates. From pooled replicate data, there was 
strong evidence that the preference for the arbitrary tradition increased 
over time (coefficient + s.e.m. = 0.13 + 0.02, P< 0.001), with an esti- 
mated 14% increase in bias per day (95% confidence interval (CI) = 
8-18%; Fig. 2a). This finding is consistent with a conformist transmis- 
sion bias, with individuals preferentially adopting the more commonly 
practised variant when solving the puzzle box”’*°*’°. More conclusive 
evidence for such positive frequency-dependent copying” was observed 
when only the first solution for each individual was considered, with birds 
disproportionately likely to initially adopt the variant used by the majority 
of their group (sigmoidal versus linear fit, AAIC = 38.34; Fig. 2b). 
Individuals thus preferentially learnt the most common option when 
first learning (Fig. 2b). Yet, remarkably, they also continued to prioritize 
social information over personal information, matching their behav- 
iour to the common variant even after experiencing an equally reward- 
ing alternative. We analysed the trajectories for those individuals that 
used both options (n = 78). The majority of these individuals (85%, 
n = 66) retained a preference for the seeded variant (for example, see 
Fig. 2c and Extended Data Fig. 4). Three birds had a strong preference 
for the uncommon variant, and eight birds switched from the alternative 


Network-based diffusion model outputs 


Transmission model AAIC (top model) La; Social transmission parameter (estimated) 95% Cl 
Social: multiplicative 0) 0:99 12.0 8.8-16.0 
Tl = = 22.4 11.8-30.2 
T2 = - 12.2 8.2-17.1 
T3 = = 73 2.9-14.3 
T4 = = 29.8 10.9-42.6 
TS - - 13.4 8.3-20.02 
Social: additive 33.7 0.01 - - 

Asocial 1520.7 ) (Constrained to 0) - 
Individual-level effects 

Variable AAIC (top model) La; Estimate Effect size 
Age (juvenile or adult) 0) 0:99 -0.18 0.70 

Sex (F or M) 0) 0.97 0.10 1.22 

Natal origin (resident or immigrant) 39 0.13 0.07 1.16 


Summed Akaike weights (Zw,) and AAIC for network-based diffusion models, with maximum-likelihood parameter estimates of social transmission for the five treatment replicates. Estimates and effect sizes are 
presented for the individual-level variables. The diffusion analyses used a continuous time of acquisition model with a constant baseline learning rate (Ao), allowing differing social transmission rates in each 


replicate. F, female; immigrant, dispersed into the study site; M, male; resident, locally born. 
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Figure 2 | Evidence of social conformity. a, The proportion of solutions using 
the seeded technique increased significantly over time in each replicate. The 
points are the proportion of solutions using the seeded technique on each day; 
the lines show the generalized estimating equation model fit. b, Comparison 
of the frequency of option A in the sub-population with an individual's first 
learnt option (pooled replicate data from T1-5). The node size represents the 


variant to the common variant. However, none of the birds made the 
reciprocal switch, and only one individual had no significant prefer- 
ence. A subset of birds that dispersed between the experimental rep- 
licates (a total of n=41, of which 24 were between the two years of the 
experiment, see below) provided additional evidence. Of 27 birds that 
moved between replicates with the same seeded tradition, 26 (96%) re- 
tained their preference for the common variant. In contrast, of 14 indi- 
viduals that moved between replicates with different seeded traditions, 
10 (71%) changed their behaviour to match the common variant in the 
new location, 3 retained their initial preference and 1 showed no pre- 
ference (a = 21.6, P< 0.001). 

Seeded arbitrary traditions thus formed and persisted in each sub- 
population (Fig. 2). To investigate the long-term stability of these tradi- 
tions, we re-installed the puzzle boxes in one replicate of each condition 
(T1, T3 and C1) over 5 days in the following winter. Substantial turn- 
over in the population had occurred owing to the high mortality rates 
typical of this species’’; on average, only 40% of each sub-population 
had been present the previous year. No additional demonstrators were 
trained, and no individual had had contact with the puzzle box in the 
intervening months. In the control sub-population, all solves (n = 42) 
were performed by only three individuals, all of which had also solved 
the puzzle box the previous year. However, in the two experimental sub- 
populations, knowledge of how to solve the puzzle box emerged even 
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number of individuals (mn = 1-147). The black line shows the expected result 
under unbiased copying; the central red line shows the model fit with 95% 
CI (outer red lines). ¢, The solution trajectories for individuals in the T2 
sub-population that used both possible options (n = 10). The lines are the 
running average of the proportion of option A for each individual over the last 
ten visits, with each colour representing a single individual. 


faster than it had the preceding year, both among prior solvers and 
birds that were inexperienced at the task: in T1, 29 individuals solved 
the puzzle box a total of 967 times, and in T3, 35 individuals solved the 
puzzle box a total of 2,329 times (Fig. 3b). The results suggest a strong 
initial effect of memory, followed bya rapid, oblique transmission facil- 
itated by the greater number of demonstrators than in the initial exper- 
iment: on the first day of exposure, 60% (T1) and 82% (T3) of ‘solvers’ 
were birds that had solved the puzzle box in the initial experiment, out- 
weighing their representation in the general population (36% in T1 and 
46% in T3). The sub-populations also retained their original technique, 
with the solutions being heavily biased towards the option that had been 
seeded in the original experiment (Fig. 3b). Intriguingly, among birds that 
were present in both years, the within-individual bias towards the seeded 
variant had increased (linear mixed model, t(g3) = 2.80, P < 0.01; Fig. 3c), 
resulting in arbitrary traditions that were retained and strengthened. 

In summary, we show that wild great tits use social learning to acquire 
novel behaviours and that foraging techniques introduced by few indi- 
viduals (here only two in each replicate) can spread rapidly to the majority 
of the population, forming stable arbitrary traditions. Both social network 
ties and individual characteristics determined the transmission of these 
foraging techniques”. The introduced arbitrary traditions were stable 
over both short-term and long-term periods, becoming increasingly 
entrenched over two generations. This stability appeared to be a result 
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Figure 3 | Local traditions persist across years. a, Diffusion curves for the 
initial exposure (T1 2013-I and T3 2013-I, where I is 20 days) and the second 
exposure (T1 2013-II and T3 2013-II, where II is 5 days, 9 months later). 
The cumulative uptake of the behaviour in the second exposure is much higher 
for prior solvers (T1 2013-II and T3 2013-II, n = 23 and 26, respectively) but 
is also higher for naive birds (T1 2013-II and T3 2013-II, n = 28 and 27, 
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respectively). b, The number of solutions using option A or B. In T1, one circuit 
board failed, so the data are derived from two of three devices. nb, native 
birds; ps, prior solvers. c, The proportion of option A or B used in the initial and 
second exposure. The histograms shows the data for the sub-populations; 

the dots show the mean proportion of option A performed by individuals, 
and error bars show the 95% CI. 
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of informational conformity, with individuals matching their behaviour 
to the most common variant when first learning and then continu- 
ously updating their personal information. Conformity has long been 
considered a central component of human culture*”°”*, but experi- 
mental evidence for its occurrence in wild animals has been limited to 
a study of food preferences in vervet monkeys°. We provide the first 
experimental demonstration, to our knowledge, of conformist trans- 
mission and cultural norms in foraging techniques in a wild animal. Our 
study argues against the previous view that such behaviour is restricted 
to the primate lineage**** °° and calls for a reconsideration of the evo- 
lution and ecology of cultural conformity. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 

Study population and area. The study was conducted in a wintering population 
of tits in Wytham Woods, UK (51° 46’ N, 01° 20’ W; Extended Data Fig. 1). One 
thousand and eighteen nest boxes suitable for great tits are installed at this site, with 
the vast majority of great tits breeding in boxes. Individuals are trapped as nestlings 
and breeding adults at nest boxes, and are fitted with both a British Trust for Orni- 
thology metal leg ring and a plastic leg ring containing a uniquely identifiable passive 
integrated transponder (PIT) tag (IB Technology). There is a further mist-netting 
effort over autumn and winter to tag individuals that immigrate into the popu- 
lation, and we estimate that over 90% of individuals had been PIT-tagged at the 
time of the study”’. In this population, great tits form loose fission—fusion flocks of 
unrelated individuals in autumn and winter. Flocks congregate at patchy food sources 
and can be observed at bird feeders fitted with PIT-tag-detecting antennae’. 
The experiments were conducted in eight sub-population areas within Wytham 
Woods that had relatively little short-term between-area movement of individuals 
(Extended Data Fig. 1). The work was subject to review by the Department of Zool- 
ogy ethical committee, University of Oxford, and was carried out under Natural 
England licences 20123075 and 20131205. 

Puzzle-box design. The experimental apparatus consisted of an opaque plastic box 
with a perch positioned in front of a door that could be slid to either side with the 
bill, to gain access to a feeder concealed behind. Video observations suggested that 
all great tits used their bill to move the door. The left side of the door was coloured 
blue, and the right side was coloured red, with a raised front section on the door to 
allow an easier grip. The concealed feeder contained approximately 500 live meal- 
worms and was refilled up to twice daily. Mealworms are a highly preferred food 
for great tits (Extended Data Fig. 5), and as live mealworms were used, solvers typi- 
cally extracted one worm and then carried it away from the puzzle box to kill and 
eat it (as confirmed by video observations; Supplementary Videos 1 and 2). Each 
puzzle box was surrounded by a 1 X 1 m? cage with a5 X 5 cm” mesh that allowed 
unlimited access by small birds but prevented access by large non-target species such 
as corvids or squirrels. A freely accessible bird feeder filled with peanut granules was 
also provided in the cage, at approximately 1 m from the puzzle box. Peanut granules 
are a much less preferred food source (Extended Data Fig. 5). Each peanut feeder 
had two access points fitted with RFID antennae and data-logging hardware. This 
feeder was used to attract the original demonstrator to the location and to record 
the identity of individuals that did not contact the puzzle box. 

All puzzle boxes contained a printed circuit board and motor and were powered 
bya 12-V sealed battery. The perch also functioned as an RFID antenna that regis- 
tered the visit duration (the time to nearest second) and the identity of the visiting 
individual. A ‘solve’ was recorded if the door was opened during an individual visit 
to the device, with the side direction also noted. If a solve occurred without an 
accompanying identified individual, this was recorded as an ‘unidentified solve’. 
One second after the solving bird departed, the door reset itself back to the middle. 
If more individuals visited before this happened, then a ‘scrounge’ was recorded, as 
these individuals were assumed to have taken food from the open door (as con- 
firmed by video observations). The door reset immediately after two individuals 
were registered scrounging, preventing more than two possible scrounging events 
per solve (Supplementary Video 2). 

Experimental procedure. Two males were captured from each sub-population 
(11 adults and 5 juveniles) to act as demonstrators. They were captured either by 
removal from roosting boxes on Sunday night or by mist-netting at a sunflower-seed 
feeder on Monday morning. They were transferred to individual cages in indoor 
captive facilities, and over 4 days, each pair of birds was subjected to one of three 
training regimens using step-wise shaping: (i) given no training and left in the cage 
with ad libitum food (control); (ii) trained to solve the novel puzzle box by pushing 
the blue side of the door to the right (option B); or (iii) trained to solve the novel 
puzzle box by pushing the red side of the door to the left (option A). With the excep- 
tion of ‘control’ areas, which were clustered in the south of the woodland to avoid 
cross-contamination, sub-populations were randomly assigned to a training regi- 
men, with both demonstrators from a single sub-population trained on the same 
technique. During training, the demonstrators were initially exposed to an open puzzle 
box baited with mealworms, which was then gradually closed over the course of 
4 days until the subjects were reliably re-opening it. The other side of the door was 
fixed during training. On Friday morning, the birds were released back at the site 
of capture in each respective sub-population. Puzzle boxes for which both options 
were available and were equally rewarding were installed at three sites 250 m apart 
on the following Sunday night (Extended Data Fig. 1). These puzzle boxes were 
run over a 4-week period at each site, continuously operating from Monday to Friday 
and then removed on Saturday and Sunday, for a total of 20 days of data collection. 

Four replicates were conducted in the first year of data collection (December 2012 
to February 2013; Cl, C2, T1 and T3). At the sites for three of these replicates (C1, 
Tl and T3), the puzzle boxes were simultaneously re-installed at the same locations 
for 5 days of further data collection in December 2013. No additional demonstrators 


were trained, and no individual had had contact with the puzzle box in the 9 months 
between the two data collection periods. This second exposure sought to test the 
long-term stability of social learning at the sub-population level. This study was 
run before the second year of data collection for the cultural diffusion experiment, 
to exclude the possibility that dispersing individuals in new replicates could be 
re-introducing the novel behaviour. An additional four replicates (C3, T2, T4 and 
T5) were then conducted from December 2013 to February 2014 in new sub- 
populations, using the same initial protocol. 

Data analysis. The local population size for each replicate was defined as compris- 
ing all individuals in a replicate that had been recorded at least once at one of the 
following: the puzzle box, the nearby peanut feeder or the nearest network-logging 
feeders (operated Saturday and Sunday, see below), during the experimental per- 
iod (that is, from the weekend following the release of the demonstrators to the 
weekend after day 20 of operation of the puzzle boxes). When three replicates were 
compared with the ‘persistence’ trial in the following year, the local population was 
defined as all individuals observed at the puzzle box or as all individuals nearby the 
peanut feeder so that the areas were comparable. 

To analyse the results of the initial experiment, we first compared control repli- 
cates and treatment replicates, by using Welch’s two-sample t-tests and by fitting 
linear and sigmoidal models to the data, with the best model ascertained by the 
difference in the AIC values”. If individuals were using social information when 
learning about the puzzle box, then we expected that there would be a difference 
between the areas seeded with a trained demonstrator (treatment) and those with- 
out (control). The replicates were thus compared in terms of latency to first solve 
(the number of seconds from the beginning of the experimental period, excluding 
demonstrator) and the total number of solutions. Second, we compared the total 
number of solutions in the two different experimental treatments. In this case, ifa 
more complex form of social learning than local enhancement to the feeding site 
was occurring, then we expected a consistent bias towards the seeded variant in the 
different treatments”. 

To analyse the change in individual and population preferences for option A or 
B over time, we used a generalized estimating equation (GEE) model’ where the 
dependent variable was the proportion of solutions using the seeded technique on 
each day of data collection and the explanatory variables were individuals and 
replicates, weighted by the overall number of solutions per day. The seeded techni- 
que (A or B) was initially also included as an explanatory variable but was not sig- 
nificant (coefficient + s.e.m. = 0.13 + 0.22, P= 0.55). Three individual variables 
were included in a GEE model: sex, age and natal origin. Sex was determined at 
capture using plumage coloration; age was determined from breeding records or 
plumage coloration; and individuals were classed as ‘immigrants’ if they had dis- 
persed into the study site and ‘locally born’ if they had been ringed as a nestling 
in the study site”. Only age was significant (coefficient + s.e.m. = —0.92 + 0.20, 
P<0.001) and was included in the final model (sex, coefficient + s.e.m. = 0.38 + 0.22, 
P = 0.08; natal origin, coefficient + s.em. = —0.38 + 0.22, P = 0.08). 

If population-level conformity was partly the result ofa conformist transmission 
bias, then at first acquisition we would expect a sigmoidal relationship between the 
population-level frequency of the option and the probability of adoption, with adop- 
tion of the majority option disproportionately more likely than its absolute fre- 
quency. By contrast, copying the last individual observed, or random copying, 
should yield a linear relationship**”®, with the probability of adopting option A or 
B being roughly equal to its proportion in the overall population. To investigate 
this, we isolated the first observed solutions by all individuals in all experimental 
replicates and compared the option choice to the proportion of all previous option 
A solves observed in the individual’s group at that site. The group length was set 
at 245s, which was the average group length observed using Gaussian mixture 
models on temporal patterns of flocking (see below) at network-logging sunflower- 
seed feeders. Both linear and sigmoidal models were then fitted to the data, with the 
best model ascertained by the difference in AIC values”. 

We further examined the subset of individuals that moved between sub-populations 
(n = 41). This subset included all individuals recorded in more than one experi- 
mental replicate, whether within the season (n = 17) or between seasons (n = 24). 
No individual was observed in more than two replicates, and this analysis did not 
include individuals in the ‘persistence’ trial. A preference for option A or B at each 
location was defined as more than 75% of all solves for either option A or B in that 
replicate. Finally, to analyse the change in within-individual bias towards option A 
or B between the initial experiment and the second-year persistence trials, we used 
a general linear model where the dependent variable was the number of solves using 
the seeded variant over the total number of solves for each individual observed 
in both years. The explanatory variables were treatment type and year, with indi- 
vidual identity as a random effect. 

Network data collection and analysis. Sunflower-seed bird-feeding stations had been 
deployed at 65 locations around Wytham Woods on an approximately 250 X 250 m? 
grid, as part of long-term research into social network structure in tits’””. Each 
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station has two access points, each fitted with RFID antennae and data-logging 
hardware. The feeding stations automatically opened from dawn to dusk on Sat- 
urday and Sunday, scanning for PIT tags every 1/16s. This study used the data 
from the eight nearest locations to each set of puzzle boxes, for ten dates within and 
surrounding the cultural diffusion experiment (the standard logging protocol runs 
from September to February in Wytham Woods”). 

Great tits were detected visiting feeding stations and were individually identified 
by their PIT tags. We then applied a Gaussian mixture model to the spatio-temporal 
data stream to detect distinct clusters of visits. This method locates high-density 
periods of feeding activity, thereby isolating flocks of feeding birds without imposing 
artificial assumptions about group boundaries’. A gambit of the group approach** 
was used with a simple-ratio index to calculate social associations, where individual 
association strengths (network edges) were scaled between 0 (never observed forag- 
ing together in the same group) to 1 (always observed in the same group and never 
observed apart). While a single co-occurrence may not be meaningful, our automated 
data collection method resulted in thousands of repeated group sampling events, 
allowing social ties between individuals to be built up from multiple observations 
of co-occurrences over time and across spatial locations. The networks contained 
123 (T1), 137 (T2), 154 (T3), 95 (T4) and 110 (T5) nodes; the average edge strengths 
were 0.09 (T1), 0.05 (T2), 0.08 (T3), 0.07 (T4) and 0.07 (T5). To test whether the 
networks contained significantly preferred and avoided relationships, we ran per- 
mutation tests on the grouping data, controlling for group size and the number of 
observations, restricting swaps within days and sites***°. We tested whether the 
observed patterns of associations were non-random by comparing the coefficient 
of variance in the observed network with the coefficient of variance in the rando- 
mized networks*’. The social networks for all replicates were significantly non- 
random, even at local scales (T1, P< 0.0001; T2, P = 0.0005; T3, P< 0.0001; T4, 
P= 0.0002; and T5, P = 0.0002). 

Finally, we used network-based approaches to ask whether the behaviour was 
socially transmitted through foraging associations. Network-based diffusion analysis 
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(NBDA) is an approach that tests for social learning by assuming that if social 
transmission is occurring, then the spread of trait acquisition should follow the 
patterns of relationships between individuals, with the transmission rate being line- 
arly proportional to the association strength**’”**. We used NBDA R code v.1.2 
(ref. 38), with the time of each individual’s first solution (the number of seconds 
since the beginning of the experiment) entered into the continuous time of acquisi- 
tion analysis function. Individuals that solved the puzzle box but that did not appear 
in the social network (that is, that had not been recorded in the standardized weekend 
logging) were excluded from the analysis. The effects of three individual-level vari- 
ables were also incorporated into the analysis: sex, age and natal origin. All com- 
binations of NBDA provided in the NBDA R code v1.2 were run, with the social 
transmission rate allowed to vary for each replicate. An AIC model averaging approach 
was used to find the best-supported model”*. 
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Extended Data Figure 1 | Wytham Woods, UK, (51° 46’ N, 01° 20’ W), puzzle-box locations for the two option A replicates, T1 and T2: Common Piece 


showing the location of replicates and puzzle boxes. The total area of and Brogden’s Belt, respectively. The red points indicate the puzzle-box 
Wytham Woods is 385 ha; the location and size of the separate woodland areas _ locations for the three option B replicates, T3, T4 and T5: Great Wood, Marley 
within the woods are labelled on the map. The green points indicate the Plantation and Pasticks, respectively. (d) indicates the locations where trained 


puzzle-box locations for the three control replicates, C1, C2 and C3: Broad Oak, | demonstrators were caught from and released to. 
Bean Wood and Singing Way, respectively. The blue points indicate the 
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Extended Data Figure 2 | Social network data collection. a, Schematic of dusk on Saturday and Sunday over winter. c, Grouping events were inferred 
a feeding station (shut), with sunflower-seed feeder, RFID antennae and from the temporal data stream gained from the feeding stations, with 
data-logging hardware. The cage is to restrict access to small passerines only. _ individuals assigned to grouping events in a bipartite network. d, Repeated 
b, Map of the study area showing the placement of 65 feeding stations. The co-occurrences were used to create social networks”. 


stations are approximately 250 m apart and open simultaneously from dawn to 
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Extended Data Figure 3 | Social networks showing the diffusion of for each replicate (T1-T5, 0.09, 0.05, 0.08, 0.07 and 0.07, respectively). 
innovation. The red nodes represent individuals that acquired the novel a, Network for the T1 replicate (n = 123). b, Network for the T2 replicate 
behaviour after 20 days of exposure. The black nodes represent naive (n = 137). ¢, Network for the T3 replicate (n = 154). d, Network for the T4 


individuals. The yellow nodes represent trained demonstrators. The networks __ replicate (n = 95). e, Network for the T5 replicate (n = 110). 
are heavily thresholded to show only the links above the average edge strength 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


o 
Lox 


° ° 
- T ] — 
To 

«© LJ =o 
ae = | £ So 
He ° i] 
5 © 3 
aa) o 
8 3 

a 

3 = ae 
ei 2a 
a € 
2 B 
a 

: Pa 

S ° 

Oo o 

0.0 O02 04 O06 O08 1.0 00 «602 «6204 «6006~«6008~=«(1.0 
Prop. of an individuals’ visits Proportion of an individuals' visits 

c d e 

° ° 


1) LI} 


0.8 


Proportion seeded option 
0 

Proportion seeded option 
0 

Proportion seeded option 
0 


< s 

oO o 

N N 

Oo o 

° So d 

S el eS ll ee ee en 
0.0 O2 04 O06 O8 1.0 00 «02 «#04 O6 O08 1.0 00 O02 04 O06 O8 1.0 
Proportion of an individuals’ visits Proportion of an individuals' visits Proportion of an individuals' visits 


Extended Data Figure 4 | Individual trajectories (option A or B) for each proportions of the seeded option for each individual over its last ten visits. a, T1 
replicate. Only individuals that performed both options are included, and (option A), n = 30. b, T2 (option A), n = 10. c, T3 (option B), n = 19. d, T4 
individuals that moved between replicates are excluded. The lines are running (option B), n = 15. e, T5 (option B), n = 4. 
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Extended Data Figure 5 | Food preference trials. The birds were presented in March 2014. Food choice was identified from video camera footage, and the 
with a freely available mix of 40 mealworms, 40 peanut granules and 40 trial was halted when all of one prey item was taken. Only great tits were 
sunflower seeds for 1 h on 2 days over 1 week at 6 sites (3 sites forT2 and3 sites _ included, but the birds could not be individually identified. The birds clearly 
for T4). The trials were conducted 2 weeks after the end of the main experiment, __ preferred the live mealworms to peanut granules or sunflower seeds. 
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Fundamental properties of unperturbed 
haematopoiesis from stem cells in vivo 


Katrin Busch!, Kay Klapproth!*, Melania Barile**, Michael Flossdorf**, Tim Holland-Letz?, Susan M. Schlenner*”, Michael Reth®”, 


Thomas Hofer? & Hans-Reimer Rodewald! 


Haematopoietic stem cells (HSCs) are widely studied by HSC trans- 
plantation into immune- and blood-cell-depleted recipients. Single 
HSCs can rebuild the system after transplantation’ °. Chromosomal 
marking’, viral integration’” and barcoding” ” of transplanted HSCs 
suggest that very low numbers of HSCs perpetuate a continuous 
stream of differentiating cells. However, the numbers of productive 
HSCs during normal haematopoiesis, and the flux of differentiating 
progeny remain unknown. Here we devise a mouse model allowing 
inducible genetic labelling of the most primitive Tie2* HSCs in bone 
marrow, and quantify label progression along haematopoietic devel- 
opment by limiting dilution analysis and data-driven modelling. 
During maintenance of the haematopoietic system, at least 30% or 
~5,000 HSCs are productive in the adult mouse after label induction. 
However, the time to approach equilibrium between labelled HSCs 
and their progeny is surprisingly long, a time scale that would exceed 
the mouse’s life. Indeed, we find that adult haematopoiesis is largely 
sustained by previously designated ‘short-term’ stem cells downstream 
of HSCs that nearly fully self-renew, and receive rare but polyclonal 
HSC input. By contrast, in fetal and early postnatal life, HSCs are 
rapidly used to establish the immune and blood system. In the adult 
mouse, 5-fluoruracil-induced leukopenia enhances the output of HSCs 
and of downstream compartments, thus accelerating haematopoietic 
flux. Label tracing also identifies a strong lineage bias in adult mice, 
with several-hundred-fold larger myeloid than lymphoid output, which 
is only marginally accentuated with age. Finally, we show that trans- 
plantation imposes severe constraints on HSC engraftment, consist- 
ent with the previously observed oligoclonal HSC activity under these 
conditions. Thus, we uncover fundamental differences between the 
normal maintenance of the haematopoietic system, its regulation by 
challenge, and its re-establishment after transplantation. HSC fate 
mapping and its linked modelling provide a quantitative framework 
for studying in situ the regulation of haematopoiesis in health and 
disease. 

The paucity of HSCs has largely impeded direct measurements of their 
functions in situ. To determine fundamental properties (frequencies of 
active HSCs, fluxes between stem and progenitor compartments, res- 
idence time and expansion in compartments) of unperturbed steady- 
state haematopoiesis'*, we devised an experimental system for inducible 
genetic marking of HSCs in situ. As driver for Cre recombinase (Cre) we 
used the Tie2 (also known as Tek) locus, which is expressed in embryonic 
and adult HSCs'*"*. We generated a knock-in mutant expressing from 
the Tie2 locus a gene encoding codon-improved Cre (iCre) fused to 
two modified oestrogen receptor binding domains (designated MCM)'* 
(Extended Data Fig. la-c). We chose this weakly inducible and tightly 
regulated system to prevent leakiness. The Tie2“™ allele was crossed to 
Rosa‘? mice expressing the yellow fluorescent protein (YFP) reporter in 
a Cre-dependent manner. In the absence of tamoxifen, we did not detect 


YFP* haematopoietic cells in bone marrow, thymus and spleen in 
Tie2@™* Rosa’ mice (n = 30; data not shown). After tamoxifen treat- 
ment, MCM becomes active and deletes the stop cassette of the YFP 
marker gene, thus rendering Cre-expressing cells and their non-Cre- 
expressing progeny YFP-positive (Extended Data Fig. 1d). Early after treat- 
ment (20 days) the labelled cells were almost exclusively HSCs, defined 
as lineage marker (Lin) ~Kit* Sca-1* (LSK) CD150*CD48° (refs 3, 17 
and 18) (Fig. 1a). Ina total of 112 adult Tie. CM/* Rosa®¥P mice, a mean 
of 1.0% of HSCs were labelled in situ after tamoxifen treatment (Extended 
Data Fig. le). Transplantation of a single YFP-marked HSC into gen- 
etically conditioned HSC recipients (Rag2/~ ye /~ Kit”; yc is also 
known as I]2rg)’” led to long-term donor HSC engraftment and multi- 
lineage reconstitution in primary and secondary recipients (Extended 
Data Fig. 1f), hence initially labelled cells were functional HSCs (Extended 
Data Table 1). We ruled out the possibility that HSC numbers and 
functions were compromised by loss of one Tie2 allele in Tie24°”* 
mice in a series of control experiments (Extended Data Fig. 2). 

We estimated by limiting dilution analysis frequencies of HSCs con- 
tributing to overall haematopoiesis, and to lymphoid and myeloid line- 
ages (Fig. 1b and Extended Data Fig. 1g-j). At least one out of three HSCs 
contributed YFP ’CD45* progeny in the bone marrow between 6 and 
34 weeks after labelling. This is the overall and cumulative frequency 
regardless of the type of lineage produced, and it represents a lower limit 
given that all mice with labelled HSCs also contained labelled progeny. 
Considering 2.8 X 10° total nucleated bone marrow cells per mouse”, 
and an HSC frequency of 0.006%, a mouse has ~17,000 HSCs; 30% 
active HSCs indicates that ~5,000 HSCs contributed to normal haema- 
topoiesis within the observation period. In transplantation experiments, 
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Figure 1 | Inducible HSC labelling in Tie2“©””* Rosa*™” mice and 
frequency estimates on HSC output. a, Phenotype of labelled cells 20 days 
after tamoxifen injection. b, Limiting dilution analysis of labelled HSCs and 
lineage output. The fraction of negative mice for YFP-expressing Lin* CD45* 
cells (green), granulocytes (red), pro B cells (blue) or double-positive (DP) 
thymocytes (black) was plotted on a logarithmic scale against the number of 
YFP* HSCs (Extended Data Fig. 1g-j). Arrows indicate lower detection limit 
(no negative mice in these groups). 
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absolute numbers of contributing HSCs are 100-fold lower'*"! than our 
estimate for steady state haematopoiesis. We also determined pathway 
frequencies from HSCs to granulocytes, and T and B cell progenitors 
(Fig. 1b and Extended Data Fig. 1g-j). The results revealed correlations 
between the numbers of labelled HSCs and labelled lineage output, with 
the (time-averaged) probability for finding labelled granulocytes being 
five- to tenfold higher than lymphocytes. 

To address the fluxes from adult HSCs via stem and progenitor com- 
partments to peripheral lineages, we resolved the output from in-situ- 
labelled HSCs kinetically (Fig. 2a-f). Unexpectedly, up to 3 weeks after 
induction in adult mice, the label was exclusively retained in HSCs with 
no label found in LSK CD150 CD48 © short-term (ST)-HSCs, and LSK 
CD150 CD48* multipotent progenitors (MPPs) (Fig. 2b). Within these 
downstream stem and progenitor compartments, the first marked cells 
emerged from 4 weeks onwards (Fig. 2c). Beyond 16 weeks, labelled HSCs 
had contributed to all analysed progenitor and mature cell lineages, with 
labelled myeloid cells arising sooner than labelled lymphoid cells (Fig. 2e, f). 
Analysis of overall bone marrow cells further indicated that the label 
emanates from marked HSCs (Extended Data Fig. 3a). Very few myeloid 
progenitors (0.010% common myeloid progenitors (CMPs); 0.001% 
granulocyte-macrophage progenitors (GMPs); 0.006% megakaryocyte 
erythroid progenitors (MEPs)) in the bone marrow were also initially 
marked (Fig. 2b, asterisk), consistent with weak expression of Tie2 in 
myeloid progenitors (http://www.immgen.org; data not shown) leading 
to direct, HSC-independent, labelling. Because of the limited life span of 
CMPs, GMPs and their progeny (Extended Data Fig. 4), the presence of 
labelled cells beyond 6 weeks after tamoxifen treatment reflects only cells 
that have arisen de novo from labelled HSCs. 

Given the extraordinarily slow label progression out of the adult HSC 
compartment, we investigated how rapidly HSCs are used during devel- 
opment (Fig. 2g-k and Extended Data Fig. 3b). Tie2”-””* Rosa**” mice 
were treated in utero in midgestation (embryonic day (E) 10.5) with tam- 
oxifen. While initially (E12.5) almost exclusively HSCs (but not erythro- 
myeloid progenitors”; Extended Data Fig. 5) were marked in fetal liver, 
the label progressed within days to progenitors in fetal liver and bone 
marrow, and by 1 week after birth, equilibrium was nearly reached 
between labelled HSCs and the entire peripheral system. Hence, HSC 
use is very rapid (and possibly complete) during development but slow 
during maintenance of the system (Fig. 21). 

We exploited the kinetic data for adult mice (Fig. 2b-f) to infer the 
fluxes between stem and progenitor compartments as well as residence 
time and expansion of the cells in the compartments (Fig. 3). In a given 
reference compartment (for example, ST-HSCs) the cells lost by onward 
differentiation (for example, towards MPPs) are replaced by influx from 
the upstream compartment (for example, HSCs), and by cell production 
in the reference compartment itself (Fig. 3a). The flux is the product of 
the rate of differentiation per cell and the total amount of cells under- 
going differentiation. The movement of label between compartments con- 
tains information on the rate (Fig. 3b), and the ratios of total cell num- 
bers were determined for stem and progenitor compartments (Extended 
Data Fig. 6a). These considerations form the basis of our model for quan- 
tifying the labelling data (Supplementary Methods and Supplementary 
Discussion). 

The steady label frequency of around 1% is consistent with self-renewal 
of labelled HSCs (Fig. 3c; HSC panel). In compartments downstream 
from HSCs, labelled cells incrementally replaced non-labelled cells 
(Fig. 3c). The mathematical model fitted the label frequencies measured 
up to 240 days after induction, and correctly predicted label frequencies 
at later time points (Fig. 3c). 

We estimated the rates of cell differentiation and net proliferation 
in the stem and progenitor compartments (Fig. 3d, e, Extended Data 
Fig. 6b, c and Supplementary Methods). The rate of net proliferation 
equals the number of cells born per day minus cells lost by death during 
the same time period. On average, per day, 1 out of 110 HSCs differ- 
entiates into an ST-HSC, and 1 out of 22 ST-HSCs differentiates into an 
MPP (Fig. 3d, f). At the MPP stage, considered a lymphoid—myeloid 
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Figure 2 | Label progression through the haematopoietic system during 
adult maintenance and fetal development. a, HSC label induction and output 
analysis in adult tamoxifen-treated Tie2”°””* Rosa’™ mice. b-f, Percentages 
of YEP cells among the indicated haematopoietic cells in bone marrow, 
spleen and thymus. Mice were analysed at 1-3 (b; n = 12), 4-6 (c; n = 10), 8-12 
(d; n = 9), 16-20 (e; n = 12) and 26-34 (f; n = 10) weeks after label induction. 
Dots represent individual mice, and bars indicate the mean. DN, double 
negative thymocytes; Gr, granulocytes; Mac, macrophages. g-k, In utero 
HSC label induction and analysis in fetal, newborn and 1-week-old 

Tie2M—“" Rosa*? mice. YEP labelling frequencies were determined at E12.5 
(h; n = 9), E15.5 (i n = 10), in newborns (j; 1 = 6) and 1 week after birth 

(k; n = 8) for the indicated organs and cells. 1, Kinetic of label progression from 
HSCs to ST-HSCs in embryonic (red; n = 66) and adult (blue; m = 110) mice 
(n per time points; see Supplementary Methods); arrows indicate tamoxifen 
treatment. Data are mean and s.e.m. 


bifurcation point, we estimate that per day 1 out of 46 MPPs generates 
a CLP, while 1 MPP generates 4 CMPs (Fig. 3d, f). Given that the cell 
numbers also increase from HSC to ST-HSC and MPP (Fig. 3f), the 
efflux of cells exceeded influx in all of these compartments. To maintain 
compartment size, this flux difference is balanced by net proliferation 
(efflux minus influx = net proliferation). The rates of net proliferation 
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Figure 3 | Inference of stem and progenitor cell differentiation, and 
proliferation from label progression. a, At steady state, the rate of cellloss ina 
reference compartment due to cell differentiation and death is balanced by cell 
influx from the upstream compartment, and by proliferation within the 
reference compartment. This balance relates the total upstream and reference 
compartment sizes (ny and ng, respectively) to the rates of cell differentiation 
(%y and ep) and to net proliferation (proliferation — death) (Bp = Ap — Og). 
b, The label frequency in the reference compartment (fg) equilibrates over time 
with the label frequency in the upstream compartment (fy). The time for 
label equilibration tz = 1/(%g — fp) (residence time) is determined by how 
rapidly cells are lost from the reference compartment. c, HSC label over time 
(blue dots; red dashed line, average), label progression (blue dots, with s.e.m., 


increased from HSC via ST-HSC to MPP compartments, in parallel with 
the differentiation rates (Fig. 3d, e). 

Together, the differentiation rate and net proliferation determine how 
close a compartment operates to self-renewal. To quantify the degree of 
self-renewal, we define the residence time in a compartment as the time 
period in which the compartment would decay to 37% (1/e) of its size, 
if all influx were switched off. The residence time is determined by the 
duration a cell and its progeny spend in the compartment before being 
lost by differentiation or cell death. The compartment residence times 
can be estimated from the labelling data (Supplementary Methods). 
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as in Fig. 21), and mathematical model fitting (red lines, best fit; grey shades, 
95% confidence bands). Data measured at ~400 days were not used for fitting 
(green points, n = 11 mice) but have been predicted correctly by the model. 
d, e, Inferred average rates of cell differentiation (d) and net proliferation (e) for 
stem and progenitor cell compartments (with 95% confidence intervals). 

f, Dynamics of stem and progenitor cell compartments inferred from the 
experimental data. Relative compartment sizes are symbolized by grey boxes, 
and magnitude of fluxes by arrow width. g, The model based on data in young 
mice (blue dots) also predicts data obtained in very old mice (green dots). 

h, Inferred rates of myeloid and lymphoid differentiation from MPPs in 
younger (7-238 days after label induction; CMPs only considered beyond 

42 days) (n = 110) and older (332-802 days) (nm = 41) mice. 


For HSCs, given that the labelling frequencies are maintained over time 
(Fig. 3c) despite efflux (Fig. 3d), net proliferation ensures complete self- 
renewal, and the residence time is theoretically infinite. Linking our esti- 
mate of the HSC net proliferation rate (~1 out of 110 per day; Fig. 3e, f) 
to proliferation measurements would imply that in assays using 5-bromo- 
2'-deoxyuridine (BrdU) ~1% of HSCs were labelled per day if HSCs 
lived indefinitely (and ~2% if HSC lifetime was 100 days). This figure 
is in the order of magnitude of the reported ~4% BrdU labelling per day 
in the ‘HSC-1’ population (as a subset of LSK SLAM-defined HSCs)", 
suggesting that the Tie2* HSCs we label in situ reside at the top of 
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Figure 4 | In situ HSC response to 5-FU challenge. a, Experimental outline. 
b, Mean of absolute leukocyte numbers in peripheral blood on the indicated 
days after 5-FU (n = 12, red dots) or PBS (n = 9, black dots) injections. 

c, Labelling frequencies of the indicated populations relative to HSCs after 5-FU 
(n = 18, red bars) or PBS (n = 15, grey bars) (mean from days 12 and 18). 
*P = 0.031 (ST-HSCs), 0.003 (MPPs), 0.213 (CLPs), 0.002 (CMPs), 0.034 
(GMPs) and 0.013 (MEPs) (two-tailed t-test assuming non-equal variances; 
5-FU versus PBS). d, Simulated increase (coloured dots and lines) in YFP 
labelling frequencies considering participation of HSCs and/or ST-HSCs versus 
experimental data (black dots; ratios of 5-FU over PBS frequencies, taken 
from c). Error bars denote s.e.m. 


the HSC hierarchy. For ST-HSCs, net proliferation almost accounts for 
the efflux towards MPPs, and the compartment requires only minimal 
influx from HSCs. Hence, even ST-HSCs operate near self-renewal, their 
residence time is exceedingly long (330 days), and label progression does 
not reach equilibrium within the ~2-year lifetime of a mouse (Fig. 3g). 
Substantial self-renewal (residence time 70 days) was found even at the 
MPP stage. Proliferation of MPPs leads to an efflux of cells into common 
lymphoid progenitors (CLPs) and CMPs that is approximately ~280 times 
the influx from ST-HSCs (Fig. 3f), making the MPPs a key amplifier. We 
also analysed progenitor compartments downstream from CLPs (pro B) 
and CMPs (MEPs and GMPs) (Extended Data Fig. 6d). 

An imbalance between myeloid and lymphoid production has been 
viewed as an age-dependent HSC property”’. We used fate mapping to 
re-address this question independent of transplantation, and found a 
marked myeloid bias that, however, was only marginally accentuated 
with age through relative loss of lymphoid potential (Fig. 3h). Such bias 
could be caused by skewed differentiation from a common progenitor, 
or by preferential proliferation in the myeloid branch. Phenotypes and 
designations of tested progenitors at putative branch points are shown 
in Extended Data Fig. 7. In both scenarios, which are not mutually 
exclusive, production of CMPs was several-hundred-fold larger than 
that of CLP. 

To examine the responsiveness of haematopoiesis to perturbation, we 
challenged HSC-labelled mice with a single injection of 5-fluoruracil (5- 
FU), a cytotoxic agent causing a transient leukopenia in the blood (Fig. 4a, b). 
After peripheral rebound, we observed higher stem and progenitor cell 
labelling frequencies, relative to HSC labelling, than in untreated mice 
(Fig. 4c). This accelerated label equilibration between HSCs and sub- 
sequent compartments after haematopoietic injury indicates feedback 
inhibition on HSCs output under steady state. These data are fit by a 
model in which the kinetics of net proliferation and differentiation are 
accelerated (at least) in both HSC and ST-HSC compartments, but not in 
either alone (Fig. 4d and Supplementary Discussion). 

Output from highly polyclonal HSCs in adult haematopoiesis in situ 
(Fig. 1b) is in contrast to oligoclonal HSC activity found in transplanta- 
tion experiments’”. To address this discrepancy, we followed the fate of 
in-situ-marked HSCs after conventional bone marrow transplantation. 
Adult Tie2@-* Rosa** mice were treated with tamoxifen, and Lin” Kit* 
bone marrow cells from each donor (CD45.2) were transplanted into 
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Figure 5 | Fate of in-situ-labelled HSCs after bone marrow transplantation. 
a, Experimental outline. b, Chimaerism of donor HSCs in recipient mice 

(n = 32) (mean, horizontal line). c, Frequencies of YFP * HSCs (black dots; level 
indicated by red dashed lines) in individual donor mice, and of YEP* donor 
HSCs (white dots) in the corresponding recipient mice. Representative data 
from a total of 11 donors and 32 recipients are shown. Grey shaded areas denote 
95% confidence intervals of sampling errors. d, Ratio of engrafted HSC label 
output over donor HSC input. Red dashed line (ratio = 1) indicates equal 
output and input (n = 32 recipients). 


1-4 lethally irradiated recipients (CD45.1) (Fig. 5a). For each donor, the 
HSC labelling frequency was recorded before transplantation (‘input’). 
Despite uniformly strong donor HSC engraftment after 16-18 weeks 
(average 90%; Fig. 5b), the percentages of YFP-marked cells among 
total donor HSC (‘output’) were highly variable compared to the input 
frequencies in individual recipients (Fig. 5c). Input and output were 
roughly equal (within sampling error) in only 6 out of 32 recipients, 
whereas in most recipients donor HSCs were either lost (14 out of 32) or 
overrepresented (12 out of 32) (Fig. 5d). We estimate that on average 1 
out of 33 donor HSCs engrafted (Extended Data Fig. 8a—c). In two ex- 
treme cases, YFP’ donor HSCs represented only 0.3% or 0.6% of the 
input, but 17% or 35%, respectively, of the output, suggesting much 
stronger proliferation than for HSCs under steady state (Extended 
Data Fig. 8d and Supplementary Discussion). In summary, the participa- 
tion of individual engrafted HSCs to the repopulation of the bone mar- 
row is highly heterogeneous. 

Inducible labelling of HSCs in normal mouse bone marrow showed 
that during development HSCs are rapidly used to establish the haema- 
topoietic system. Once this is accomplished, individual HSCs are only 
rarely active, but over time a large portion of HSCs contributes to adult 
haematopoiesis. Indeed, although the mean HSC labelling frequency 
was low, all mice with marked HSCs produced labelled progeny, indi- 
cating that a large fraction, or at least 30%, of all HSCs contributes to 
haematopoiesis in adult mice after label induction. By contrast, pre- 
vious work based on barcoding showed that few HSCs actively drive 
haematopoiesis after transplantation. We re-addressed HSC diversity 
in the wake of transplantation, avoiding potential pitfalls of cellular 
heterogeneity of mixing experiments. The observed HSC oligoclonality 
is hence a hallmark of post-transplantation but not normal unperturbed 
haematopoiesis. These findings indicate that experimental and possibly 
also clinical HSC transplantations are based on a much smaller stem 
cell foundation than physiological haematopoiesis. 

HSC proliferation has often been taken as a proxy for asymmetric cell 
division (for example, ref. 23) and, also indirectly, as measure of differ- 
entiation rates. Proliferation may, however, not only yield differentiating 
progeny but also compensate for cell loss, precluding proliferation as 
an unambiguous marker of differentiation rates. Here, we quantified 
haematopoietic flux based on label progression from HSCs. Our data 
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support in principle an order of differentiation in situ from HSCs to 
ST-HSCs and MPPs and onwards. However, in divergence from the 
idea of a continuous stream from the top of a haematopoietic pyramid, 
a very low flux emanated from HSCs. While ST-HSCs are relatively 
short-lived on transplantation, in situ this compartment is exceedingly 
long-lived because ST-HSC self-renewal is almost sufficient to make 
up for the cell loss by differentiation. We hence identified this com- 
partment as the primary source of haematopoietic maintenance in 
mice. This reservoir property of ST-HSCs readily explains the apparent 
HSC-independence of haematopoiesis noted in a recent report”. How- 
ever, to maintain the ST-HSC compartment in the long run (>1 year), 
it requires continuous input from HSCs; we estimate that per day 150 
HSCs feed into this compartment (17,000 total HSC 1/110 differ- 
entiating per day). Hence, true HSC deficiency may go unnoticed for 
extended periods of time while functionally impaired ST-HSC and MPP 
compartments would cause rapid signs of acute bone marrow failure 
(Extended Data Fig. 9). 

Collectively, HSCs act in development as founding stem cells, and 
in adult mice as replenishing cells, ST-HSCs as long-term amplifying 
cells, and MPPs as intermediate-term amplifying cells. The described 
fate-mapping system may also visualize responses to haematopoietic 
challenges imposed by cancer, infections, cachexia or ageing. The accel- 
erated HSC output in response to haematopoietic injury by treatment 
with 5-FU underscores this outlook. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


Generation of Tie2“©”* knock-in mice. For inducible marking of HSC in situ we 
used the Tie2 locus as driver for Cre recombinase (Cre). Because conventional Cre 
was insufficient for inducible HSC labelling in our first Tie2 knock-in mutant, we 
constructed a second one in which we inserted a gene encoding codon-improved 
Cre (iCre) fused to two modified oestrogen receptor binding domains, referred to 
as MCM"*, into the first exon of the Tie2 locus by homologous recombination in 
embryonic stem cells. The targeting strategy is depicted schematically in Extended 
Data Fig. la. The targeting construct consisted, from 5’ to 3’, of a short homologous 
arm (from — 1898 to —1 base pairs (bp)) upstream of the ATG start codon of the Tie2 
gene (ENSMUSG00000006386), a splice donor site, an intron and a splice acceptor site 
(intron), all taken from the rabbit B-hemoglobin (HBB2) gene”, the coding sequence 
of codon-improved Cre’ flanked by two mutated oestrogen receptor sites'®”’, a 
poly-adenylation signal (pA) from the rabbit B-hemoglobin gene, a FRT-flanked 
neomycin (Neo) resistance gene, a long homologous arm (nucleotides +4 to +4983 
of the Tie2 gene, considering the adenine in the ATG start codon as position +1), 
and finally the diphtheria toxin subunit A gene (DT-A) for selection against random 
integration. This weakly inducible and tightly regulated system was chosen to pre- 
vent leakiness”*. The complete nucleotide sequence of the final targeting vector can 
be obtained from the authors on request. Gene targeting experiments were per- 
formed in E14.1 embryonic stem cells. Correct homologous recombination of the 
targeting vector resulted in replacement of the start codon (nucleotides +1 to +3) in 
the first exon of the Tie2 locus with the MCM cassette. Correctly targeted clones were 
first identified by PCR. Targeted embryonic stem cell clones were transiently trans- 
fected with a plasmid (pCAGGS-FIpE-GFP) expressing Flp-recombinase (gift from 
H. J. Fehling) to delete the Neo cassette. Site-specific integration and Neo deletion 
were confirmed by Southern blotting. DNA was digested with BspHI. Blots were 
hybridized with a radiolabelled 1.0-kb iCre-specific probe. All Tie2 gene sequences 
upstream and downstream of exon 1 are preserved in the Tie2““ allele. A Neo- 
deficient embryonic stem cell subclone (Tie24““”") was injected into C57BL/6 
blastocysts, and chimaeric mice were backcrossed to C57BL/6 mice to transmit the 
Tie2M allele. Heterozygous Tie2““”* mice are fertile and show no apparent 
abnormalities. Homozygous TieQh-W/MCM embryos die between E9.5 and E12.5, 
as previously described for Tie2~’~ mice”?*°. 

Induction of reporter gene expression by tamoxifen. Tic. mice were crossed 
to Rosa**? (Gt(ROSA)26Sor"™!1EYFP)CoS) mice*!, Tamoxifen (1 g; Sigma T5648) was 
dissolved in 4ml ethanol absolute and 36 ml peanut oil (Sigma P2144) at 55°C. 
Aliquots of tamoxifen (25 mg ml~') were stored at —20°C and heated to 37°C 
shortly before usage. Mice were injected daily on 5 consecutive days with 1 mg 
tamoxifen intraperitoneally. For in utero tamoxifen treatments the day of the plug 
was regarded as day 0.5 of the pregnancy (E0.5). Pregnant mice were treated by oral 
gavage on E7.5 or E10.5 with a single dose of 2.5 mg tamoxifen and 1.75 mg pro- 
gesterone (Sigma P0130) to counteract late fetal abortions. Delivery of the pups was 
routinely assisted by caesarean section at E20.5, and mice were raised by foster 
mothers. 

Flow cytometry. Bone marrow cells were flushed from femurs, tibias, coxa and 
humeri using PBS supplemented with 5% heat-inactivated FCS in PBS. Cells were 
filtered through a 20-,1m filter (Falcon). Spleens, thymi and fetal organs were directly 
mashed in a 20-um filter with a plunger of a syringe. Fc receptors were blocked 
by incubating cells in 5% FCS with purified mouse IgG (500 1g ml ', Jackson 
ImmunoResearch Laboratories). All stainings were performed in 5% FCS on ice for 
30 min with optimal dilutions of commercially-prepared antibodies. Reagents used 
were CD3é allophycocyanin (APC) (17A2), CD3e eFluor780 (17A2), CD3¢ phy- 
coerythrin (PE) (145-C11), CD4 PE-Cy7 (GK1.5), CD8 APC (53-6.7), CD11b PE 
(M1/70), CD11b PE-Cy7 (M1/70), CD11b PerCP Cy5.5 (M1/70), CD16/32 PE-Cy7 
(93), CD16/32 PE-Cy5.5 (93), CD19 PerCP-Cy5.5 (ID3), CD34 eFluor660 (RAM34), 
CD45 PE-Cy7 (30-F11), CD45.1 eFluor660 (A20), CD45.2 PE (104), CD48 APC 
(HM 48-1), CD48 PE (HM 48-1), CD117 eFluor780 (2B8), CD127 PE-Cy7 (A7R34), 
CD135 PE (A2F10), Sca-1 PE-Cy7 (D7), Sca-1 PerCP-Cy5.5 (D7), Tie2 biotin (bio) 
(TEK4), Tie2 PE (TEK4) (eBioscience), CD3¢ bio (500A2), CD4 bio (GK1.5), CD4 
PE (H129.19), CD8 bio (53-6.7), CD8 PE (53-6.7), CD11b APC (M1/70), CD19 APC 
(1D3), CD19 bio (1D3), CD19 PE (1D3), CD45 bio (30-F11), Gr-1 APC (RB6-8CS), 
Gr-1 bio (RB6-8C5), Gr-1 PE (RB6-8C5), IgM bio (R6-60.2), IgM PE (R6-60.2), 
Streptavidin PE-Cy7, Sca-1 PE (E13 161.7), Terl119 APC (Terl19), Ter119 bio 
(Ter119), Ter119 PE (Ter119) (BD Pharmingen), CD4 APC (RM4.5), CD19 QDot605 
(6D9), streptavidin APC, streptavidin QDot605, Sca-1 PE-Cy5.5 (D7) (Invitrogen/ 
Molecular Probes), CD11b bio (M1/70.15) (Caltag), CD135 bio (A2F19), CD150 
PE-Cy7 (TC15-12F12.2), CD229 PE (Ly9AB3) (Biolegend). The lineage cocktail (Lin) 
was composed of CD3s, CD4, CD8, CD11b, CD19, Gr-1 and Ter119. To enrich for 
Lin-negative progenitor populations, bone marrow cells were stained with lineage 
markers followed by depletion with Dynabeads (Life technologies) according to the 
manufacturer’s instruction. Dead cells were excluded by staining with Sytox Blue 
(Invitrogen). Cells were analysed on a FACSFortessa, or sorted by FACSArialII (all 
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Becton & Dickinson), and data were analysed by BD FACSDiva software. Bone 
marrow cell populations were defined as follows: HSCs (Lin Kit* Sca-1*CD1507 
CD487), ST-HSCs (Lin” Kit* Sca-1*CD1507 CD48"), ST-HSCs CD2297, ST-HSCs 
CD229", MPPs (Lin Kit ‘Sca-1*CD150° CD48*), CLPs (Lin IL7R* Flk2* Kit” 
Sca-1"), pro B cells (CD19*IgM_ ), B cells (CD19*IgM*), CMPs (Lin” Kit* Sca-17 
CD34*CD16/32°"), GMPs (Lin Kit“ Sca-1- CD34" CD16/32") and MEPs (Lin 
Kit *Sca-1~ CD34” CD 16/32"). Populations in the spleen were defined as follows: 
T cells (CD3*CD19" CD11b Gr-1_ ), B cells (CD19*CD3" CD11b Gr-1 ), gran- 
ulocytes (Gr-1*CD11b'CD3 CD19_ ) and macrophages (CD1 1b‘ Gr-1- CD3~ 
CD19). The populations in the thymus were defined as follows: double-negative 
thymocytes (CD4 CD8" ) (DN), double-positive thymocytes (CD4* CD8"*) (DP), 
CD4* thymocytes (CD4* CD8~) (CD4) or CD8* thymocytes (CD4~ CD8*) (CD8). 
Erythromyeloid progenitors (Lin” KitCD45*"°") in fetal liver were defined as 
described previously*!”. 

Mice. Mice carrying constitutively active reporter alleles (pan YFP or panRFP) were 
generated by crossing Rosa’ or Rosa“’*"” reporter mice*** to germline Cre-deleter 
mice**. Offspring constitutively expressing YFP or REP in all tissues were back- 
crossed towards the C57BL/6 background, and used in competitive transplantation 
experiments (Extended Data Fig. 2h, i) and in adoptive transfers of myeloid and 
lymphoid progenitors (Extended Data Fig. 4). Rag?’ yc ‘~ Kit” mice'? were 
used for adoptive transfer experiments without previous irradiation (Extended Data 
Figs 1f, 2h, i and 4). For transfer experiments with irradiation, congenic B6.SJL- 
Ptprca Pep3b/Boy] (H-2°; CD45.1*) mice were used as recipients (Fig. 5). All animal 
procedures were approved by the Regierungsprasidium Karlsruhe, and performed in 
accordance with the Institutional Guidelines. Both male and female mice were used 
at ages ranging from embryonic E12.5 to around 120 weeks (2 years). No mice were 
excluded from the analysis. No randomization and no blinding were used. 

PCR genotyping. Tissues were lysed in lysis buffer (DirectPCR-Lysis Reagent Tail, 
Peqlab) according to the manufacturer’s instruction. Mice were genotyped by PCR 
for 2 min at 94°C (20s at 94°C, 30s at 51°C, 1 min at 72°C) 35 times; 10 min 
at 72°C using a common 5’ oligonucleotide annealing upstream of the rabbit B- 
hemaglobin gene (5'-CATCGCATACCATACATAGGTGGAGG-3’) and a 3’ oli- 
gonucleotide annealing to the rabbit B-hemaglobin gene (5’-AATCAAGGGTCC 
CCAAACTCAC-3’), yielding a 526-bp DNA fragment indicating the Tie2“-™ 
allele, and a 3’ oligonucleotide (5’-GAGGCAGCATCTGTCTACAAGAGATGG-3’), 
yielding a 745-bp DNA fragment indicating the Tie2* allele. 

Single-cell transplantation. YFP* HSCs (LSK CD150* CD48" ) were isolated from 
tamoxifen-treated Tie2“-”* Rosa*"” mice by electronic single-cell deposition into 
individual wells of a U-bottom 96-well plate containing 100 ul sterile 5% FCS. Before 
injection, 100 ll sterile PBS was added to each well, and single cells were injected 
intravenously into individual Rag2/ ~ ye” Kit” recipients. Peripheral blood 
samples were collected 4-8 weeks after transfer from the submandibular vein into 
EDTA-containing microtubes (Sarstedt) to screen for progeny of donor cells (detected 
by YFP expression). Organs of recipient mice were analysed at least 16 weeks after 
transplantation as shown in Extended Data Fig. 1f. For secondary transplantations, 
YFP* Lin’ Kit* bone marrow cells from the primary recipients were purified by cell 
sorting and 3 X 10* cells were intraveneously injected into individual Rag?“ yc /~ 
Kit” recipients. At least 16 weeks after transfer organs were analysed by flow 
cytometry. 

Competitive transplantation. Equal cell numbers (500-1,000 of each) of HSCs 
(LSK CD150*CD48~) from Tie2““’* mice (either with panRFP reporter allele 
or without), and wild-type competitor HSCs (LSK CD150* CD48") from Tie2*/* 
mice (pan YFP) were co-injected intraveneously into Rag2/ ~ ye” Kit” recipi- 
ent mice. Peripheral blood samples were collected 4-8 weeks after transfer to screen 
for progeny of donor cells (not shown), and after at least 16 weeks the organs indi- 
cated in Extended Data Fig. 2 were analysed for the contributions of progeny from 
Tie2MOM* or Tie2*/* HSC. To assess the life span of myeloid progenitor and 
mature granulocytes, cell-sorter-purified CMPs and GMPs (mixed as one popu- 
lation) from panRFP mice were intraveneously injected together with CLPs from 
panYFP mice into Rag?’ ~ ye’ Kit” recipient mice (5 X 10*CMPs plus GMPs 
and 0.5 X 10* CLPs per mouse). Peripheral blood samples were collected 7, 14, 21 
and 32 days after transplantation from the submandibular vein into EDTA-contain- 
ing microtubes (Sarstedt) to screen by flow cytometry for donor-derived progeny 
cells. 

Transplantation into lethally irradiated mice. Lin-negative Kit* cells from tamox- 
ifen-treated Tie2!@—”* Rosa*™” mice (CD45.2) were sorted and 0.5 X 10°-2.0 X 10° 
cells injected intraveneously into lethally irradiated (1,100 cGy; split dose with 4h 
time gap between each dose; Cesium 137 GammacCell40 Irradiator, Besttheratronics) 
congenic B6.SJL-Ptprca Pep3b/Boy] mice. Recipient mice were maintained on anti- 
biotic water (1.17 g1~' neomycin sulphate) for 14 days. Bone marrow of recipient 
mice was analysed after at least 16 weeks after transplantation. 

Proliferation assay. Tie2““* and Tie2‘’* littermate mice were injected intra- 
peritoneally with 1 mg EdU (Invitrogen) in PBS. After 24h, the bone marrow was 
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collected, cells were stained with the appropriate antibodies before sorting cell- 
surface receptor Tie2-positive (LSK Tie2™CD150* CD48") and cell-surface receptor 
Tie2-negative (LSK Tie2” CD150* CD487) HSCs, each from Tie2@™* and Tie2*/* 
littermates. Cells were collected in individual tubes containing 50% FCS, fixed and 
permeabilized, and the Click-it reaction was performed using the Click-iT EdU Flow 
Cytometry Assay Kit (Invitrogen) according to the manufacturer’s protocol. 

5-FU treatment. 5-FU (Sigma F6627) was dissolved in sterile PBS, and tamoxifen- 
treated Tie24-”* Rosa™*” mice were intravenously injected with a single dose of 
250mgkg~' or PBS. Mice were maintained on antibiotic (1.17 g neomycin sul- 
phate per litre of drinking water) for 14 days. Peripheral blood was collected 4, 7, 10 
and 12 days after 5-FU from the submandibular vein into EDT A-containing micro- 
tubes (Sarstedt). Absolute numbers of leukocytes per microlitre blood were deter- 
mined by flow cytometry using anti CD45 PE-Cy7 (30-F11; eBioscience) antibody 
and APC-conjugated CaliBRITE Beads (BD Biosciences) as standard. On day 12 or 
18 after 5-FU treatment bone marrow cells were isolated and analysed by flow 
cytometry. With two-tailed t-test, effect size d = 1, « = 0.05 and power 0.8, group 
sizes should be 17 per sample group. We chose = 15 (control) and n = 18 (5-FU 
treatment). 

Mathematical methods. Mathematical modelling and parameter inference from 
the experimental data are described in the Supplementary Information. The cor- 
responding Matlab codes are available on request. No statistical methods were 
used to predetermine sample size, except for 5-FU treatment. 
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Extended Data Figure 1 | Generation of the Tie2” allele, labelling of 
HSCs in tamoxifen-treated Tie2”©”* Rosa**” mice, and groups for 
limiting dilution analysis. a, The endogenous Tie2 locus, the gene targeting 
vector and the targeted allele with (TieQ@MN©?) and without (Tie24-M4N”) 
neomycin are depicted. Oligonucleotides and DNA probe for genotyping, and 
restriction sites used for Southern blotting are indicated (not drawn to scale). 
b, PCR verification of the targeted Tie2“™ allele in embryonic stem cells 
before and after neomycin deletion. c, Southern blot analysis of Tie. MNeo 
(—Flp) and TiegNes (+Flp), and wild-type embryonic stem-cell clones. 
d, Principle of inducible fate mapping. In the absence of tamoxifen MCM is 
inactive (reporter). Tamoxifen treatment activates the reporter (reporter™) 
After tamoxifen treatment, labelled cells and their progeny remain marked 
(reporter®”). e, Summary of HSC labelling frequencies of tamoxifen-treated 
Tie2M—“"" Rosa"? mice (n = 112; 5 times on 5 consecutive days) analysed 
between 1 and 34 weeks after label induction. These data are the basis for the 
kinetic analysis (Fig. 2a-f) and for the mathematical modelling (Fig. 3). Each 


dot represents an individual mouse. Bar indicates mean (1.041 + 0.8013 s.d.). 
f, A single YFP* LSK CD150*CD48~ HSC from a tamoxifen-treated 

Tie2™’* Rosa®*® mouse was transplanted into a Rag2 “ye Kit” 
recipient mouse (1° transfer). Donor cells were identified by YFP expression, 
and analysed 16 weeks after transplantation in bone marrow (BM), thymus and 
spleen using the markers shown. YFP * Kit* donor bone marrow cells were re- 
transplanted into a secondary Rag?’ yc’ Kit’ recipient (2° transfer), 
and analysed as described for the primary transfer. g-j, HSC labelling 
frequencies in tamoxifen-treated Tie2“—”* Rosa*™” mice analysed 6 weeks 
onwards after label induction were used for limiting dilution analysis of CD45* 
output, granulocytes (n = 60) (g, h), pro B cells and double-positive thymocytes 
(n = 79) (i,j). Each dot represents an individual mouse. Mice grouped together 
are highlighted in black or white (groups I-IV). Mathematical calculations are 
shown in the tables (h, j). In g, data shown represent the aggregate of labelling 
frequencies below 1% shown in e, plus data obtained in mice receiving only a 
single tamoxifen injection. 
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Extended Data Figure 2 | Characterization of Tie24““””* mice. Tie2‘’* (white) mice. Data are mean ~ s.d. g, Proliferation rates in surface 


a-d, Numbers of haematopoietic cell subsets isolated from bone marrow 

(a, b), thymus (c) and spleen (d) of Tie. CM/* (ny = 10; black) and Tie2*’* 
(n = 10; white) littermates were determined by flow cytometry. Each dot 
represents an individual mouse. Bars indicate mean. e, f, Flow cytometric 
analysis of a representative TiegtomMi+ (top) and Tie2*’* (bottom) mouse 
gated on Lin cells, and analysed for expression of Kit versus Sca-1. Further 
analysis of Lin Kit’ Sca-1* (LSK) cells for CD150 and CD48 revealed 
comparable marker distributions (f). Percentages of LSK cells among the Lin 
fraction (left), and of HSCs, ST-HSCs and MPPs among the LSK fraction 
(right) in the bone marrow of three independent Tie2@”* (black) and 


receptor Tie2-positive and Tie2-negative HSCs in the bone marrow of 
Tie2@* (black) and Tie2*’/* (white) mice 24h after EdU administration. 
Data represent mean = s.d. from two independent experiments of FACS-sorted 
populations from Tie. CM* (neo = 5S NExp2 = 3) and Tie2*/* (NExp1 = 35 
NExp2 = 2) mice. h, i, Rag2’ ~ ye Kit recipients (n = 30; for analysis of 
B and T cells in the spleen n = 28) were injected with equivalent numbers of 
Tie2@@’* and Tie2*/* HSCs (500-1,000 of each), and analysed after at least 
16 weeks. The percentages of Tie2”°”’* HSC-derived haematopoietic cells in 
bone marrow, spleen and thymus are shown. Each dot represents an individual 
mouse. Bars indicate mean. 
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Extended Data Figure 3 | Kinetics of YFP label emergence after label 
induction in total bone marrow cells in adult mice, and in fetal liver and 
bone marrow cells in fetal and early postnatal mice. a, Percentages of YFP* 
cells among total non-lineage-depleted bone marrow cells of tamoxifen-treated 
Tie2M—“" Rosa*™” mice (n = 47). Time point 0 corresponds to the time of 
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tamoxifen treatment of adult mice (all of which were at least 6 weeks). 

b, Tie2@—™* Rosa**” mice (n = 32) were treated with tamoxifen on E10.5 
(time point 0). Subsequently, percentages of YFP™ cells were determined 
among total fetal liver on E12.5 and E15.5, and in bone marrow at birth and 
1 week of age. Each dot represents an individual mouse. 
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6 weeks after adoptive transfer. a, Rag2 ’ > yo! ~ Kit recipient mice b, Analysis of donor chimaerism of CMP/GMP-derived (Gr-1*CD1 1b* 


(n = 8) received CMPs and GMPs (together 5 X 10* per mouse) from panRFP granulocytes, red) and CLP-derived (CD19* B cells, yellow) progeny. Each line 
mice together with CLPs (0.5 X 10* per mouse) from panYFP mice. Peripheral _ represents an individual mouse. 
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Extended Data Figure 5 | Labelling of HSCs, but not erythromyeloid 
progenitors, in Tie2“©* Rosa**” embryos treated on E10.5 in utero by 
tamoxifen. a, Tie2@* Rosa” embryos were treated with tamoxifen on 
E7.5 or E10.5, and analysed at E12.5. b-e, Fetal liver cells from representative 
Tie2M-“"" Rosa**? embryos labelled on E10.5 (b) or E7.5 (d) were analysed by 
flow cytometry for YFP labelling in HSCs (LSK CD150* CD48") and 
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erythromyeloid progenitors (EMP) (Lin Kit* CD45*/") (see Methods for 
references). HSCs were marked in mice labelled at E7.5 and E10.5, but 
erythromyeloid progenitors were not marked in mice labelled on E10.5. 

c, d, Summary of experiments depicted in b (n = 8) (c) and d (n = 6) (e). Each 
dot represents an individual embryo. Bars indicate mean. 
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Extended Data Figure 6 | Details of model analysis. a, Ratio of compartment 
sizes for stem and progenitor cell compartments. Compartment sizes were 
determined by cell counting of cell suspensions from bone marrow, spleen and 
thymus in a Neubauer chamber combined with phenotypic definition of each 
population by flow cytometry. Mean and s.e.m. for n = 10 mice are shown. 
b, Confidence bounds for the model parameters. Profile likelihoods (blue lines) 
for the inverse residence times, k = 1/t, of cells in the ST-HSC, MPP, CMP and 
CLP compartments (units: per day). The profile likelihoods have been 
computed as described in Supplementary Methods, with the experimental data 
from Fig. 3c. The red lines indicate the 95% confidence level. Except for the 
CMPs, for which the data are consistent with a broad range of k (that is, the 
CMP residence time is low), the residence times are accurately determined by 
the label propagation data alone. In particular, the very low k for the ST-HSCs 
shows that this compartment operates near self-renewal. c, Profile likelihoods 
(blue lines) and 95% confidence level (red lines) for the net proliferation rates 
(f) and differentiation rates («), computed with the data of Fig. 3c and 
Extended Data Fig. 4 (unit: per day). The differentiation rate from HSCs 
(%34sc—>st-usc) has the same profile likelihood as the net proliferation of HSC 


(Bsc) and is therefore not shown. o%4s-—>sr-Hsc %st-Hsc> MPP» &Mpp—>cLP» 
Bysc and Bsy ssc are accurately determined by the data. Moreover, both 
ompp—cmp and Pypp have a lower bound that is one and two orders of 
magnitude larger than the respective parameters for ST-HSCs and HSCs, 
showing that MPPs have significantly higher proliferation and differentiation 
activities than the preceding compartments. Note that (Spp) and &pp—cmp 
are strongly correlated (not shown). d, Label progression data and fit of the 
mathematical model for further downstream myeloid precursors (CMPs to 
GMPs or MEPs) and lymphoid precursors (CLPs to pro B cells) in the bone 
marrow. Data were measured up to 238 days after label induction (blue points, 
with s.e.m., as in Fig. 21) have been used for the model fit (red lines, best fit and 
grey shades, 95% confidence bands). The parameter values are as in 
Supplementary Table 1 and, in addition, %cyp—cmp = 2 (0.04, 4) day, 
&cmp—mep = 3 (0.1,4) day |, AcLP— pro B = 2 (0.8, 4) day ', Bemp = 4(-1,4) 
day”, Bcrp = 3 (0.4, 4) day *, temp = 0.12 (0.12, 33) days, tTyep = 0.13 (0.13, 
22) days, Tpron = 54 (6, 141) days (in brackets: 95% confidence bounds). 

For further details see Supplementary Information. 
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Extended Data Figure 7 | Consequences of lymphoid—-myeloid branch 
points at distinct stem or progenitor stages. a—c, Lymphoid-myeloid branch 
points were considered at the MPP stage (a) or at the earlier CD229* ST-HSC 
subset (b). Each population is phenotypically defined as shown in c. MPPs 
have also been termed HPC-1, and the CD229* ST-HSC subset has also been 
termed MPP2/3 (ref. 18). Cell flux rates per day are shown from the branch 
points to CMPs or CLPs (right panels in a and b, with 95% confidence 
intervals). In both scenarios, the production of CMPs is several-hundred-fold 
larger than that of CLPs. Assuming the branch point at the CD229* ST-HSC 
stage, biased myeloid differentiation is still evident (~5-fold), but the large 
uneven production is mainly achieved by flux amplification downstream from 
the bifurcation. d, Label progression data and fit of the mathematical model. 


Data measured up to 311 days after label induction in the ST-HSC CD229 and 
ST-HSC CD229* compartments (orange points, with s.e.m.; groups of mice: 
57 days n = 7; 122 days n = 4; 311 days n = 8) and in CLP, MPP and CMP 
compartments data measured up to 238 days (orange points, with s.e.m.; 
groups of adult mice as in Fig. 21) have been used for the model fit (red lines, 
best fit; grey shades, 95% confidence bands). The resulting parameters 

are Osc cp229-sT-Hsc = 9.0001 (0.00001-0.00016) day’, %cD229-ST-HSC— 
cp229+sT-Hsc = 4 (2-4) day |, a¢¢p229+-s1-Hsc—>mpp = 0.03 (0.02-0.06) day, 
cp229+8T-HSC—>cLP = 0.007 (0.004-0.012) day ', Bepa20-st-sc = 4 (2-4) 
day, Bcp229+st-Hsc = 0.0 (—0.06-0.0016) day! and Bypp = 4 (0.3-4) 

day * (in brackets: 95% confidence intervals). For further details see 
Supplementary Methods. 
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Extended Data Figure 8 | Limiting dilution analysis of in-situ-labelled 
transplanted HSCs. a, Tie2@!@—’* Rosa**? mice were injected with tamoxifen 
(donor bone marrow; see Fig. 5a). Numbers of YFP* HSCs in donor bone 
marrow transplanted into 38 recipient mice are displayed. Each dot is an 
individual recipient. Mice grouped together are highlighted in black or white 
(groups I-IV). b, Data underlying limiting dilution analysis. c, Numbers of 
injected YFP* donor HSCs are plotted against the fraction of YFP-negative 


4 6 8 10 12 


Number of generations 
per engrafted HSC 


recipients on a logarithmic scale. d, Histogram showing the extent of self- 
renewing proliferation for engrafted YFP" HSCs in mice labelled ‘within 
sampling error’ and ‘overrepresented’ in Fig. 5d (n = 18). Proliferation is shown 
as the number of generations per engrafted HSC needed to achieve the 
measured frequencies of YFP" HSCs in the bone marrow 4 months after 
transplantation. For further details see Supplementary Discussion. 
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Extended Data Figure 9 | Simulated effects of ablation of HSCs, ST-HSCs or 
MPPs. In the mathematical model used to fit the label progression data 

(Fig. 3c), numbers of HSCs (a), HSCs and ST-HSCs (b), and HSCs, ST-HSCs 
and MPPs (c) were reduced from its steady state value to zero at t = 0, and 
subsequently held there. The predicted responses in the downstream 
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compartments are shown. The horizontal black lines at 1/e = 37% indicate 
the residence times of the compartments that control the time scale of the 
response. Note that these simulations assume that there are no homeostatic 
mechanisms present within the compartments that could maintain their size 
independent of input from upstream compartments. 
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Extended Data Table 1 | Transplantations of in-situ-labelled HSC 


Fraction Number of injected Number of 1° recipients Number of 2° recipients 
donor cells 1° recipients Long-term Transient 2° recipients Long-term Transient 
LSK CD150+ CD48- YFP+ 1 23 1(4%) 4(17%) 3 1(33%) 2(66%) 
LSK CD150+ CD48- YFP+ 5 1 1(100%) - nd. 
LSK CD150+ CD48- YFP+ 8 1 1(100%) = n.d. 
LSK YFP+ 10 3 1(33%) 2(66%) 3 1(33%) 2(66%) 
LSK YFP+ 15 3 1(33%) 2(66%) 3 1(33%) 2(66%) 
LSK YFP+ 18 4 1(33%) : 3 1(33%) 2(66%) 
LSK YFP+ 20 3 1(33%) 2(66%) 5 2(40%) 3(60%) 
LSK YFP+ 40 3 3(100%) : 9 2(22%) 7(88%) 


n.d.: not done 
YFP-marked stem cells from tamoxifen-treated Tie2"-”* Rosa’*? mice were injected into primary and secondary Rag2~“~ yc” Kit” recipient mice. Phenotypes and cell numbers of injected cells, numbers of 
primary (1°) and secondary (2°) recipients, and reconstitution results are given. Percentages in brackets reflect the proportions of reconstituted mice. 
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Tissue-resident macrophages originate from 
yolk-sac-derived erythro-myeloid progenitors 


Elisa Gomez Perdiguero'*, Kay Klapproth**, Christian Schulz!, Katrin Busch?, Emanuele Azzoni*, Lucile Crozet’, Hannah Garner’, 
Celine Trouillet', Marella F. de Bruijn’, Frederic Geissmann!s§ & Hans-Reimer Rodewald?s 


Most haematopoietic cells renew from adult haematopoietic stem cells 
(HSCs)'?, however, macrophages in adult tissues can self-maintain 
independently of HSCs*”’. Progenitors with macrophage potential 
in vitro have been described in the yolk sac before emergence of 
HSCs*”’, and fetal macrophages'*** can develop independently of 
Myb*, a transcription factor required for HSC", and can persist in 
adult tissues*’”"*. Nevertheless, the origin of adult macrophages and 
the qualitative and quantitative contributions of HSC and putative 
non-HSC-derived progenitors are still unclear’’. Here we show in mice 
that the vast majority of adult tissue-resident macrophages in liver 
(Kupffer cells), brain (microglia), epidermis (Langerhans cells) and 
lung (alveolar macrophages) originate froma Tie2* (also known as 
Tek) cellular pathway generating CsfIr* erythro-myeloid progenitors 
(EMPs) distinct from HSCs. EMPs develop in the yolk sac at embry- 
onic day (E) 8.5, migrate and colonize the nascent fetal liver before 
E10.5, and give rise to fetal erythrocytes, macrophages, granulocytes 
and monocytes until at least E16.5. Subsequently, HSC-derived cells 
replace erythrocytes, granulocytes and monocytes. Kupffer cells, mi- 
croglia and Langerhans cells are only marginally replaced in one- 
year-old mice, whereas alveolar macrophages may be progressively 
replaced in ageing mice. Our fate-mapping experiments identify, in 
the fetal liver, a sequence of yolk sac EMP-derived and HSC-derived 
haematopoiesis, and identify yolk sac EMPs as a common origin for 
tissue macrophages. 

Csflr-expressing cells in the mouse embryo give rise to tissue-resident 
macrophages in adult tissues*. To identify in the developing embryo 
the site of origin of Csflr-expressing cells, we performed time course 
analyses by constitutive (Csf1r'“”’) and inducible (Csf1r""""”) fate- 
mapping of cells in the yolk sac, head, limbs, caudal region and fetal liver 
(Fig. laand Extended Data Fig. 1). Progenitors, defined as Kit* CD45'° 
(ref. 12) (gate R1 in Fig. 1b), were first detected in Csf1 r'’Rosa26tP 
embryos in the yolk sac from 16-18 somite pairs (sp) stage onwards 
(E8.5, Fig. 1b, and Extended Data Fig. la—c). CsfIr'’ YEP* Kit” CD45* 
cells (gate R2 in Fig. 1b), characterized in Fig. 2 as myeloid cells, were 
detected in the yolk sac at 20-25 sp (E9, Fig. 1b), and subsequently in 
the caudal and head regions of the embryo from E9.5, and the fetal liver 
from E10.5 onwards (Extended Data Fig. la—d). To discriminate mi- 
gration of YFP* cells from de novo labelling, we induced YFP expression 
in Csf1r"™™"" Rosa26*"” embryos at E6.5 or E8.5. In embryos pulsed 
at E6.5, YFP™ cells were not detected (Extended Data Fig. 2a, b). When 
pulsed at E8.5, Csf1r“"""""" YEP * Kit “ CD45"° progenitors were detected 
between E9.5-11.5 in the yolk sac, and in the fetal liver from E10.5 
(Fig. 1c, d). In the fetal liver, numbers of YFP *kit* CD45" progenitors 
increased threefold from E10.5 to E11.5, at which time they were 25- 
fold more numerous in the fetal liver than in the yolk sac (Fig. 1d). At 
E8.5, all YS Csf1 io? YEP? Kit? CD45"° progenitors expressed AA4.1, 
an antigen expressed on early haematopoietic progenitors'* (Extended 
Data Fig. le). Csfir°"™reM"" YEP * AA4.1* Kit "CD45" cells were also 


present in the yolk sac from E9.5 to E10.5, and in the fetal liver from 
E10.5 (Fig. le). These progenitors were undetectable at E10.5 in the aorta- 
gonado-mesonephros (AGM) region (Fig. le), indicating they do not 
originate within the embryo proper. 

Together, these fate-mapping experiments demonstrate that yolk-sac- 
derived progenitors colonize the liver primordium, as proposed earl- 
ier®’°*!, and their expression of AA4.1 suggests that they represent 
erythro-myeloid progenitors (EMPs)”. In in vitro colony-forming assays, 
the AA4.1* population contained most of the total E9 yolk sac colony- 
forming-units-culture (CFU-C 266 + 137 vs 296 + 75, mean +stand- 
ard deviation (s.d.)). Frequencies and distributions of different CFU-C, 
that is, erythroid (E)/megakaryocyte (Mk) (E/Mk), granulocyte/macro- 
phage (G/M), and G,MLE, and/or Mk (Mix) potential, were comparable 
between overall AA4.1~ and Csf1r'"* YFP* AA4.1* progenitors (Fig, 2a, 
Extended Data Fig. 3). Moreover, in the E12.5 fetal liver, the CFU po- 
tential of overall AA4.1* and Csf1r""™"" YEP* AA4.1~ cells was 
comparable to the yolk sac progenitors (Fig. 2a). 

These results indicated that yolk-sac-derived, E8.5-labelled YEP* 
AA4.1* Kit* CD45"° progenitors have erythroid and myeloid potential 
in yolk sac and fetal liver. Next, we investigated by fate-mapping their 
contribution to fetal liver haematopoiesis in vivo. CsfIr' YFP* and 
CsfirMer'creMer YEP * F4/80°"'8"" fetal macrophages were first detected 
among Kit CD45* (R2 in Fig. 1b) at E10.5 in the yolk sac, liver, head 
and forelimbs (Fig. 2b, Extended Data Fig. 4a-c). In addition, the fetal 
liver from E12.5 to E16.5 contained Csf1r“°""""" YFP* monocytes 
and granulocytes (Fig. 2c). The fetal liver also contained Csf1r““"-"™"" 
YEP* red blood cells from E11.5 until at least E14.5 (Fig. 2d, Extended 
Data Fig. 4d). Red blood cells were not labelled before E11.5, indicating 
that, in contrast to yolk-sac-derived erythrocytes in the fetal liver, prim- 
itive erythrocytes in the yolk sac did not arise from Csflr-expressing 
cells. Collectively, yolk-sac-derived Csflr* progenitors contribute to 
fetal liver haematopoiesis by giving rise to F4/80°"8"' macrophages, 
monocytes, granulocytes and red blood cells. 

We next investigated the transition from yolk-sac-derived to HSC- 
derived haematopoiesis. To trace the latter, we used FIt3©" which labels 
fetal and adult HSC-derived multipotent haematopoietic progenitors”, 
and their progeny (Extended Data Fig. 5). We compared progeny of 
yolk-sac-derived progenitors in Csf1r“"-""" mice to progeny of HSCs 
in Flt3C” mice. In the fetal liver from E14.5 to E18.5, the progenies of 
Csfir* and Fit3* precursors were distinct but complemented each other 
(Fig. 3a). AtE14.5, yolk-sac-derived cD45* populations included Kit* 
progenitors, F4/80°""* macrophages, and CD11b"Gr1* monocytes/ 
granulocytes (Fig. 3a, Extended Data Fig. 6a). Of note, monocytes/ 
granulocytes were present in Myb-deficient fetal liver (Fig. 3a). 
Csr" YEP* macrophages remained detectable throughout fetal 
development, and were not replaced by FIt3“° YFP* cells. However, 
yolk-sac-derived Kit* cells and myeloid cells were no longer detectable 
by E16.5 and E18.5, respectively (Fig. 3a). In contrast, Flt3“"° YFP* Kit 
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Figure 1 | E8.5 Csfir* progenitors originate in the yolk sac and expand in 
the fetal liver. a, Fate-mapping analysis of Csflr-expressing cells. Arrows 
indicate time points for analysis, and green shades the genetic labelling period. 
b, YFP expression on live cells from Csf1 r’Rosa26"? yolk sac (YS), separated 
by somite pairs between E8.25 and E9 (0 sp, n = 3; 3-6 sp, n = 3; 7-9 sp, 

n= 5; 11-13 sp, n = 3; 16-18 sp, n = 4; 20-25 sp, n = 4) (upper panels), 

and Kit and CD45 expression on YFP” cells (lower panels). R1 indicates 
Kit*CD45"°, and R2 indicates Kit” CD45* cells. c, Schematic representation of 
sites analysed in mouse embryos: YS, AGM region, fetal liver and head. 

Kit and CD45 phenotype of YFP* cells from Csf1r“°"'0"™"" Rosa26""? 
embryos pulsed with OH-TAM at E8.5 (E9.5, n = 3; E10.25, n = 3; E10.5, 


n= 4; E11.25, n = 4; E12.5, n = 9). d, Number of YEP* Kit* CD45" cells (R1 
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cells, and CD11b"Gr1* granulocytes/ monocytes increased in numbers 
between E14.5 and E18.5 (Fig. 3a—c). The progenies of CsfIr* progeni- 
tors and Fit3* progenitors were also distinct during development in the 
lung and skin (Extended Data Fig. 6b, c). Quantitative analyses in fetal 
and adult tissues indicated that Flt3“* YFP labelling of Kit* progeni- 
tors preceded that of monocytes/granulocytes, with 80% of progeni- 
tors labelled at E18.5, and 80% of monocytes at postnatal day 8 (P8) 
(Fig. 3b). In contrast, Flt3“" YFP labelling plateaued at 14% for adult 
liver F4/80°"8" Kupffer cells, at 2% for CD45"° brain microglia, and 30% 
for epidermal Langerhans cells up to one year of life (Fig. 3b). In con- 
trast, Fit3\° YEP labelling of CD45* F4/80° brain macrophages, and 
lung alveolar macrophages was 16% in 12-week-old adults but increased 
progressively over time to reach 40% in one-year-old mice (Fig. 3b). 
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in panel b) per organ or region (mean + s.e.m.) in Csf1r“°"™r"M"" Rosa26""” 
embryos pulsed at E8.5 (Source Data Table for Fig. 1). e, Number of 

YEP* AA4.1*Kit* CD45" cells per embryonic region and time points 

(mean + s.e.m.) in Csftr°"O"*" Rosa26*"” embryos pulsed at E8.5 (Source 
Data Table for Fig. 1). See also Supplementary Table 1 and Extended Data 
Figs 1 and 2. 


Altogether, these data indicate that Flt3“” YFP * Kit* progenitors and 
monocytes account for only minor fractions of microglia, Kupffer cells, 
alveolar macrophages and Langerhans cells in young adults. To investi- 
gate whether the presence of these adult Fit3©” YFP ' F4/80°"8"" macro- 
phages corresponds to their HSC origin, we performed non-myeloablative 
transplantations of YFP* long-term-HSCs (LT-HSCs) from adult wild- 
type bone marrow into Rag?’ y, ’ Kit” recipients” (Extended Data 
Fig. 7). Eight weeks after transplantation, the vast majority of HSCs, 
myeloid progenitors, monocytes and F4/80"° tissue myeloid cells in the 
recipients were of donor HSC origin. In contrast, only 7% of F4/80°"8"* 
macrophages in spleen, 2% in liver, 5% in lung, 13% in pancreas, 2% in 
epidermis and 0% in the brain were donor-derived. Thus, recruitment 
of HSC-derived precursors is not a major mechanism for the mainte- 
nance of F4/80°"'8"* macrophages in these tissues. 

Collectively, these findings reveal that the transition from yolk-sac- to 
HSC-derived haematopoiesis occurs late in fetal development for mono- 
cytes (E14.5) and granulocytes (E16.5), and suggest that HSC-derived 
progenitors only marginally replace yolk-sac-derived microglia in the 
brain, Kupffer cells in the liver, Langerhans cells in the epidermis, 
although alveolar macrophages and brain CD45" F4/80* macrophages 
may undergo progressive replacement with age. 

Labelling efficiency of most tissue-resident macrophage popula- 
tions in adult Csf1r”""™""Rosa26""” mice pulse-labelled with 4- 
hydroxytamoxifen (OH-TAM) at E8.5 was low***. The strength of most 
genetic pulse-labelling systems is that they allow fate-mapping of cells 
during a specific time window, however, a weakness is the commonly 
incomplete labelling which could explain why a large fraction of tissue- 
resident macrophages remained unlabelled. Hence, based on these data 
we cannot formally exclude a fetal HSC origin of the unlabelled cells as 
suggested by others based on transfer of fetal precursors**”°. 

We thus made use of a newly generated inducible Cre knock-in mouse 
(Tie2°"'re"") to track haematopoietic output from haematopoietic 
progenitors and HSCs in situ (Busch K. et al., submitted). Tie2 (also 
known as Tek) is expressed in endothelial cells, yolk sac progenitors, 
aorta-gonado-mesonephros region, fetal liver and adult HSCs’. We as- 
sessed the time window at which Tie2* cells contributed to emerging 
HSCs and macrophages by injecting tamoxifen at different time points 
(Fig. 4a, Extended Data Figs 8-10). Fetal liver E12.5 and E15.5 LT-HSCs 
were labelled efficiently in Tie2"°"""™"" embryos pulsed at E6.5, E7.5 
or E10.5 (Fig. 4b, Extended Data Figs 8 and 9). Yolk sac E9.5 Kit" CD45"° 
progenitors were also labelled in Tie2“°"©"™*"Rosa26"" embryos 
pulsed at E7.5 (Extended Data Fig. 10). Interestingly, fetal liver cells 
with a megakaryocyte-erythrocyte progenitor (MEP) phenotype, and 
F4/go>rsht macrophages in yolk sac, brain, and fetal liver were labelled 
with high efficiency (60%) in embryos pulsed at E6.5 and E7.5, but not 
in embryos pulsed at E10.5 (Fig. 4b, Extended Data Fig. 8). These fate- 
mapping experiments directly demonstrate that E12.5 and E15.5 fetal 
macrophages originate from cells that express Tie2* as early as E6.5 
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Figure 2 | E8.5 Csfir* progenitors differentiate into myeloid cells and red 
blood cells in the fetal liver. a, Distribution of mixed (Mix), G/M, and E/Mk 
CFU-C from unsorted, AA4.1*Kit* CD45" and YFP* AA4.1*Kit* CD45” 
cells from E9 yolk sac from Csflr'“"Rosa26""” and E12.5 fetal liver from 
Csfir°"'r"e™""Rosa26"*? embryos pulsed with OH-TAM at E8.5 (three 
independent experiments each). CFU-C erythroid and/or megakaryocyte 
(E/Mk); CFU-granulocyte and/or monocyte/macrophage (G/M); CFU-C mix, 
at least three of the following: G, E, M and Mk. See also Extended Data Fig. 3. 
b, F4/80 and CD11b expression on YFP*CD45~ from yolk sac, head (brain 
for E11.5), limbs and liver of Csf1r°"°""""Rosa26**” embryos pulsed with 
OH-TAM at E8.5, and analysed on E9.5 (n = 3), E10.5 (n = 4), E11.5 (n = 4), 
and E12.5 (n = 9); Dashed lines represent FMO (fluorescence minus one) 


and, importantly, before E9.5, and strongly support the notion that fetal 
liver erythro-myeloid progenitors, and all fetal tissue macrophages up 
to E15.5 are of yolk sac origin. 


control. See also Extended Data Fig. 4. c, F4/80, CD11b, Gr1, Ly-6G 

and Ly-6C expression in fetal liver CD45" YFP* cells from 
Csfir°""""Rosa26"*” embryos pulse-labelled at E8.5, and analysed on 
E12.5 (n = 9) and E16.5 (n = 14). May-Griinwald-Giemsa stained cytospin 
preparations of fetal liver YFP* F4/80°°8"' and YFP*CD11b” cells sorted 
from E16.5 Csftr“°""""Rosa26""” embryos pulsed with OH-TAM at 

E8.5. See also Extended Data Fig. 6a for sorted cells from E14.5 Csf1r““"r""” 
embryos. Scale bar, 10 jum. d, YFP labellin, efficiency (%) among red blood cells 
in fetal liver from Csf1r°"°"""Rosa26""” embryos pulsed with OH-TAM 
at E8.5, mean + s.e.m. (E11.5, n = 4; E12.5, n = 9; E14.5, n = 5; E16.5, 

n= 11; E18.5, n = 4; see Source Data Table Fig. 3). 


In adult mice pulsed at embryonic stages (E7.5, or E8.5, or E9.5 or 
E10.5), bone marrow HSC-derived progenitors, peripheral cells (T and 
B cells, and granulocytes) in the spleen, and CD11b"F4/80'° myeloid cells 


a Fetal liver Myb“- 
E14.5 E16.5 E18.5 E14.5 E16.5 E18.5 E16.5 E18.5 j E14.5 E16.5 
S 
CD45* ae 
| 
1 4 # 
CD11b ———> 
CsfirterCreMer 
YFP* Gas) ¢ 
pulsed at E8.5 Fit3©e YFP*+ CD11b"! F4/80!, E18.5 
Fite YFP* 
-80o 66 
Kit > CD11b >: Gri 
b GB Lineage” Kit* [2] Blood monocytes 


W F4/80"° CD11b" 


Gl CD45* F4/80* macrophages 


i F4/80"° CD11b" Bi Dermal CD11b"! 


S bright lo ( = 5 : > = ' 
; S400 BN F4/80! CD11b'° (Kupffer cells) S400 BS Microglia F400 f8 Alveolar MF S400 BS Epidermal LCs 
a2 2 2 2 
“ @ 80 oO > 80 > 80 
>~ os ts) ts) is) 
2 o 60 © o 60 o 60 
2s = = 40 2 40 
ea” 5 5 5 
B 20 8 8 20 8 20 
ao 9 a a a 
m7 408 w ca S NN yon 
= ABS AW ABO PB wt DW OW AY > eo aw Ae goth ay > ow AD ON Ay > 28 ant 42 yO Ay 
Liver Brain Lung Skin 


Figure 3 | Fetal liver HSC-derived Fit3* progenitors give rise to monocytes 
and granulocytes in late embryos and adults but do not replace yolk-sac- 
derived macrophages. a, F4/80, Kit, CD11b and Gr1 expression on total 
CD45" cells (black) and YFP*CD45* cells from Csf1r“°""°™"" Rosa26""” 
embryos pulsed at E8.5 (green) in the fetal liver at the indicated days of 
embryonic development (E14.5, n = 5; E16.5, n = 10; E18.5, n = 9). F4/80, Kit, 
CD11b and Grl expression on YFP CD45" cells from FIt3°"*Rosa26.!” 
embryos (orange) (E14.5, n = 7; E16.5, n = 6; E18.5, n = 6). F4/80 and 
aa expression on CD45* cells in Myb- ‘~ embryos (E14.5, n = 4; E16.5, 
7). b, YFP labelling efficiency in Kit*lin™ cells, CD11b™ F4/80!° cells 
es in Extended Data Fig. 5) and F4/g0Preht macrophages (Kupffer 
cells in adults) in fetal and adult Flt3©Rosa26*"” liver (first panel on the left). 


YFP labelling efficiency i in blood monocytes, brain microglia (CD45'°F4/80*) 
and CD45*F4/80* brain macrophages in Flt3“"’Rosa26""” pups and mice 
(second panel). YFP labelling efficiency in alveolar macrophages (F4/80°"8"" 
Siglec-F* CD11b_) and F4/80° CD11b™ myeloid cells in Flt3“"*Rosa26""” 
lungs (third panel). YFP labelling efficiency in epidermal Langerhans cells 
(LCs) and dermal CD11b" (MHC II" EpCAM_ ) myeloid cells in 
FIt3“’Rosa26*"” skin (fourth panel, see Extended Data Fig. 6b, c). 

Mean + s.e.m.; P8, n = 3; 4-week-old, n = 6; 12-week-old, n = 11-14; 
40-week-old, n = 7; 1-year-old, n = 3, see Source Data table Fig. 3. w, week; 
y, year. c, Representative i images of May-Griinwald-Giemsa stained cytospin 
preparations of YFP* CD11b"F4/80"° cells sorted from E18.5 FIt3“”* 
Rosa26*"” fetal liver. Scale bar, 10 um. 
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Figure 4 | Fetal macrophages and adult tissue-resident macrophages 
originate from Tie2-expressing progenitors before E10.5. a, Fate-mapping 
analysis of Tie2-expressing cells after tamoxifen (TAM) administration at 
E7.5, or E8.5, or E9.5 or E10.5. Arrows indicate time points for analysis. b, Flow 
cytometric analysis of fetal liver long-term or short-term haematopoietic 
stem cells (LT-HSCs, ST-HSCs), multipotent progenitors (MPPs), common 
myeloid progenitors (CMPs), granulocyte-monocyte progenitors (GMPs), 
megakaryocyte-erythrocyte progenitors (MEPs) (left panel) and of fetal 
macrophages (right panel) in the yolk sac, brain, and fetal liver. Time points 
of labelling (E7.5 (n = 7); E10.5 (n = 7)) and analysis are indicated, and for each 
experiment one representative analysis is shown. See Extended Data Fig. 8 
for quantitative analysis. c, Frequencies of labelled HSCs and progenitor 
cells, splenocytes, and F4/ g0’°CD11b myeloid cells and F4/ goers resident 
macrophages in spleen, liver lung, epidermis and brain were analysed 

(mean + s.d., see Source Data Table Fig. 4) from 6-8-week-old TiegereMer 
animals pulse-labelled at E7.5 (n = 4), E8.5 (n = 4), E9.5 (n = 4) or 

E10.5 (n =6). 


in peripheral tissues (spleen, liver and lung) were homogenously la- 
belled at frequencies comparable to HSC labelling, consistent with their 
adult HSC origin (Fig. 4c). In contrast, YFP labelling frequencies of adult 
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tissue-resident macrophages were maximal in animals pulse-labelled at 
E7.5, declined at later time points and were minimal when labelled at 
E10.5 (Fig. 4c). The fact that adult HSCs are disconnected from res- 
ident macrophages is further underscored by the finding that resident 
macrophages in mice pulsed at E7.5 were labelled at higher frequencies 
than adult HSCs, that is, labelling efficiency did not equilibrate with mouse 
development. In summary, these inducible temporal analyses demon- 
strate that although both macrophages and HSCs originate from progen- 
itors expressing Tie2 as early as E6.5, adult tissue-resident macrophages 
in the brain (microglia), liver (Kupffer cells), lung (alveolar macrophages), 
skin (Langerhans cells), and (to some extent) spleen (F4/ goerisht mac- 
rophages) develop almost exclusively from an Tie2-expressing progen- 
itor pathway distinct from HSCs. These data are consistent with results 
from Csf1r°"™ pulse-labelling experiments (see Figs 1 and 2), with 
our earlier observation that resident macrophages are independent of 
the transcription factor Myb*, and complement our data obtained in 
Fit3“°Rosa26**” mice (see Fig. 3). 

This study demonstrates that Myb-independent tissue-resident mac- 
rophages* originate from yolk-sac-derived EMPs, characterized by 
expression of Csf1r from E8.5 (16-18 somites). The data do not distin- 
guish whether resident macrophages originate from erythro-myeloid, 
granulocyte-macrophage, or macrophage only-progenitors because these 
potentials coexist within the yolk-sac-derived EMP population. 

Wealso provide strong in vivo evidence for engraftment of yolk-sac- 
derived EMPs in the early fetal liver. These cells substantially contrib- 
ute to the first wave of fetal liver haematopoiesis, followed later by bona 
fide fetal liver HSC-derived haematopoiesis*'*’. Conclusions from re- 
cent studies that Langerhans cells and alveolar macrophages are not of 
yolk sac origin based on transfer of fetal precursors***° should be inter- 
preted in light of our findings that yolk sac EMPs expand in the fetal 
liver and are the main source for tissue-resident macrophages. 

Under steady-state conditions, yolk-sac-derived macrophages are only 
marginally replaced by HSC-derived cells in the brain, liver and epi- 
dermis. It is remarkable that macrophages of yolk sac origin persist in 
functionally very distinct tissues, suggesting that the origin is more de- 
terministic of the life span than the tissue location. However, some yolk- 
sac-derived macrophages can undergo replacement in older mice, as 
for lung alveolar macrophages. In a third group, exemplified by gut- 
associated macrophages”, yolk-sac-derived macrophages are replaced 
by HSC-derived macrophages in the first weeks of post-natal life. The 
mechanisms responsible for the maintenance of yolk-sac-derived mac- 
rophages in certain adult tissues require further investigation. Although 
yolk-sac- and HSC-derived macrophages can co-exist in the same envi- 
ronment, and their balance be perturbed by pathology, the contributions 
of these developmentally distinct macrophage populations to home- 
ostasis and inflammation remain to be characterized. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 

Animals. Myb™! ~ (ref. 28), Csfl pMeriCreMer (ref 29), Csf1 7 (ref. 30), FIt3~” (ref. 31), 
Rag2! vy! Kit™ WY (ref, 23) and Rosa26""? reporter” mice have been prev- 
iously described. Rag2'~y.-'~ Kit’ were on a mixed genetic background”, 
sf MeriereM” and Csr’ mice were on FVB background, other mice were on 
C57BL/6 (CD45.2) background. 

CsfrMerioreMe and Csflr'“” mice were generated and provided by J. W. Pollard. 
Myb~'~ mice were generated and provided by J. Frampton. FIt3“° mice were gen- 
erated by C. Bleul and provided by T. Boehm and S. E. Jacobsen. Rosa26""” 
(B6.129X1-Gt(ROSA)26Sor™=*FPs/J) reporter mice were purchased from The 
Jackson Laboratory. To achieve tamoxifen-dependent Cre activity in Tie2-expressing 
cells in vivo we generated Tie2”°""""" mice. We inserted a tamoxifen inducible 
codon-improved recombinase (iCre)*? flanked by two mutated oestrogen receptor 
sites (MeriCreMer)* into the first exon of the Tie2 locus. Tie2™°"'-""™*" mice were 
crossed to Rosa26*"” mice to fate-map the progeny of Tie2-expressing cells. 

No randomization method was used and the investigators were blinded to the 
genotype of the embryos and animals during the experimental procedure. Results 
are displayed as mean + s.e.m. (Figs 1, 2 and 3) or s.d. (Fig. 4, Supplementary Table 1). 
All experiments included littermate controls and the minimum sample size used 
was 3. Embryonic development was estimated considering the day of vaginal plug 
formation as 0.5 days post-coitum (dpc), and staged by developmental criteria®. In 
Figs 1 and 2 and Extended Data Figs 1-4, embryos were included based on their 
somite number for embryonic days < E11.5, as described in ref. 8. No statistical 
method was used to predetermine sample size. 

Allanimal procedures were performed in adherence to our project licence issued 
by the United Kingdom Home Office under the Animals (Scientific Procedures) Act 
1986, or by the German regional council at the Regierungsprasidium Karlsruhe, 
Germany, respectively. 

Genotyping. PCR genotyping of Myb*’, Csflr'” (ref. 30), Rag? ‘~y. / Kit” 
(ref. 23), Csfirer'-re*” and FIt3\ mice’ was performed according to protocols 
described previously. PCR genotyping of Tie2”°"'-""" will be described elsewhere 
(Busch K. et al., submitted). 

Processing of tissues for flow cytometry. Pregnant females were killed by cervical 
dislocation or by exposure to CO. Embryos ranging from embryonic day (E) 8.25 
to E18.5 were removed from the uterus and washed in 4°C phosphate-buffered sa- 
line (PBS, Invitrogen). The yolk sac (YS) was harvested from embryos between 
E8.25 and E12.5. Embryos were exsanguinated through decapitation in PBS 1X 
10 mM EDTA. To obtain single-cell suspensions, organs were incubated in PBS con- 
taining 1 mg ml” ‘collagenase D (Roche), 100 U ml’ DNase I (Sigma) and 3% fetal 
calf serum (FCS, Invitrogen) at 37 °C for 30 min. 

Adult tissues (P8 to 1 year) were prepared as follows. Blood was collected by 

cardiac puncture from anaesthetized (isoflurane inhalation) mice. Under terminal 
anaesthesia, mice were perfused by gentle intracardiac injection of 10 ml prewarmed 
(37°C) 1X PBS. The spleen, right liver lobe and right lung lobes were collected and 
processed for flow cytometry. To obtain single-cell suspensions, organs were incu- 
bated for 30 min in PBS containing 1 mg ml‘ Collagenase D (Roche), 100 U ml! 
DNase I (Sigma), 2.4 mg ml~ 1 of dispase (Invitrogen) and 3% FCS (Invitrogen) at 
37°C. Brains from Tie2M°"'r’M"" were dissociated and incubated for 60 min at 
37°C in HBSS with 0.2 mg ml * collagenase D, 20 jg ml * dispase I (Roche), and 
50U ml ! DNase I (Sigma). Brain cells were resuspended in isotonic Percoll (Phar- 
macia) at a final density of 1.072 g ml ' in HBBS containing 3% FCS. The suspen- 
sion was underlayered with Percoll solution at 1.088 g ml”! and overlayered with 
additional layers of Percoll (1.06, 1.05 and 1.03 g ml~ 1) After centrifugation, cells 
were collected from 1.06 and 1.072 gml * layers. Brains from the other strains 
were processed as described for the spleen. For collection of Langerhans cells from 
TiegericreMer mice, epidermal sheets were prepared using an epidermis dissoci- 
ation kit (Miltenyi Biotec). In the other strains, epidermal sheets were separated 
from the dermis after incubation for 45 min at 37°C in 2.4 mg ml‘ of dispase (Invi- 
trogen) and 3% FCS (Invitrogen) and the epidermis was further digested for 30 min 
in PBS containing 1 mg ml‘ collagenase D (Roche), 100 U ml ' DNase I (Sigma), 
24mg ml! of dispase (Invitrogen) and 3% FCS (Invitrogen) at 37°C. 
Flow cytometric analysis of embryonic and adult tissues and cell sorting. Tissues 
were mechanically dissociated and passed through a 100 jum cell strainer (BD). Red 
blood cell lysis of fetal liver and adult lung and spleen was performed as described”. 
Cells were centrifuged at 320g for 7 min, resuspended in 4°C PBS, plated in multi- 
well round-bottom plates and immunolabelled for FACS analysis. After 15 min 
incubation with purified anti-CD16/32 (FcyRIII/II) diluted 1/50, or ChromPure 
mouse IgG whole molecule (Dianova) diluted 1/20 in staining buffer (1X PBS; 0.5% 
BSA; 2 mM ETDA), antibody mixes were added and incubated for 30 min. Where 
appropriate, cells were further incubated with streptavidin conjugates for 20 min. 
The full list of antibodies used can be found in Supplementary Table 2. 


Flow cytometry was performed using a BD Biosciences FACSCanto II flow cytom- 
eter or a BD Biosciences LSR Fortessa cell analyser. All data were analysed using 
FlowJo 9.5 (Tree Star) or FACS Diva software (BD Bioscience). 

Fetal liver, skin and lung YFP * 4/9078 and YEP‘ CD11b" cells from E18.5 
FUt3°°Rosa26""” embryos and from E14.5 and E16.5 Csflr or" Rosa26""” em- 
bryos pulsed at E8.5 were sorted into FCS-coated tubes using FACSAria II for 
cytospin preparations. 

Pulse labelling of Csfir* and Tie2* progenitors. For genetic cell labelling we 
crossed tamoxifen-inducible Csf1r@°"r'M and Tie2M°"'-"™*" transgenic mouse 
strains with Rosa26""” reporter mice. In Csf1r"""""" Rosa26""” embryos recom- 
bination was induced by single injection at E8.5 of 75 ug per g (body weight) of 
4-hydroxytamoxifen (Sigma) into pregnant females. The 4-hydroxytamoxifen was 
supplemented with 37.5 1g per g (body weight) progesterone (Sigma). In Tie2M°"reMer 
Rosa26"*” embryos recombination was induced by treatment of pregnant females 
by gavage at different time points (between E7.5 and E10.5) with a single dose of 
2.5 mg tamoxifen (Sigma) and 1.75 mg progesterone (Sigma) to counteract the mixed 
oestrogen agonist effects of tamoxifen, which can result in fetal abortions. 
Continuous labelling of Csflr* progenitors. For fate-mapping analysis of Csflr* 
precursors, Csf1 7" females were crossed with homozygous Rosa26v"? reporter 
males. Indicated tissues from embryos and adult F1 mice were analysed by flow 
cytometry. 

Fate-mapping of Fit3* haematopoietic progenitors. For fate-mapping analysis 
of Flt3* precursors, Ft3“* males (the transgene is located on the Y chromosome) 
were crossed to homozygous Rosa26”"” reporter females. For adult experiments, 
Fit3“ males were blood phenotyped. Animals with YFP labelling efficiency above 
60% in the lymphocytes, monocytes and granulocytes were used for experiments 
and female littermates were used as Cre-negative controls. 

Colony forming assays. Colony-forming-unit-culture (CFU-C) assays were per- 
formed using Methocult M3434 (Stem Cell Technologies) as described in ref. 36. 
Embryos were collected and dissected in PBS (Gibco, Invitrogen) supplemented 
with 10% FCS (batch tested and obtained from Gibco), 50 U m7! penicillin, and 
50 pg ml” ' streptomycin (Cambrex Corporation). E9 embryos were staged by so- 
mite counting. E9 yolk sac and E12 fetal livers were each pooled and incubated for 
30 min at 37°C in PBS supplemented with 10% FCS, 50 U ml penicillin, 50 pg ml“! 
streptomycin, 1 mg ml‘ collagenase D (Roche) and 100 U ml’ DNase (Sigma), 
and dissociated by pipetting. Suspensions were washed, and viable cells were counted 
on the basis of trypan blue (Sigma) exclusion using a Kova hemocytometer slide. 

AA4.1~ progenitors were isolated by flow cytometry using using FACSAria II 
or FACSAria III. Labelling of cells was performed as described above using the 
following antibodies: CD45-APC-Cy7, Kit-PE and AA4.1-APC (Supplementary 
Table 2), and live cells were gated on the basis of Hoechst 33258 exclusion. Cells 
were collected into FCS-coated tubes and recounted before plating where possible. 
Gates were defined using unstained, single stained and fluorescence minus one (FMO) 
stained cells. 

Cells were plated in duplicate in 35-mm culture dishes according to manufac- 
turer’s instructions. Cultures were grown at 37 °C with 5% CO, with colonies scored 
after 10 days. 

Colonies were picked and washed once with phosphate-buffered saline (PBS; 

Gibco, Invitrogen) supplemented with 10% fetal calf serum (FCS; batch tested and 
obtained from Gibco). Cytospin preparations were stained with May-Griinwald- 
Giemsa method for morphological inspection of colonies (see below). 
Morphological analysis of sorted cells and colonies. Cytospin preparations were 
performed using a Cytospin 3 (Thermo Shandon) by centrifuging (i) cells from col- 
onies at 400 r.p.m. for 4 min (medium acceleration) or (ii) sorted cells at 500 r.p.m. 
for 10 min (low acceleration). Slides were air-dried for at least 30 min, and fixed for 
5 min in methanol. Methanol-fixed cytospin preparations were manually stained 
in 50% May-Griinwald solution for 5 min, 14% Giemsa for 15 min, washed with 
Sorensons buffered distilled water (pH 6.8) for 5 min and rinsed with Sorensons 
buffered distilled water (pH 6.8). After air-drying, slides were mounted with Entellan 
New (Merck) and representative pictures were taken using a Nikon eclipse E6000 
microscope with a Nikon Plan Fluor 60X/1.40 NA oil DIC H objective and NIS- 
elements BR2.30 software (Nikon). 
Transplantation of HSCs without irradiation. HSC transplantation in non- 
irradiated Rag2’-y, / Kit” mice was performed as described previously’. In 
brief, approximately 1000 LT-HSCs (lin” Sca-1* Kit" CD150* CD48° ) isolated from 
the bone marrow of panRosa**” mice, which carry a constitutively active YFP reporter 
allele, were injected into Rag2 / “yo Kit” mice. Recipients were analysed 
2 months after transplantation for donor/host chimaerism in blood, spleen, lung, 
liver, pancreas, brain and epidermis. To test the functionality of E12.5 phenotypic 
LT-HSCs, 10 YFP* LSK CD150*CD48° (phenotypic LT-HSCs) from Tie2”°"r"eM"r 
Rosa26*"” pulsed at E7.5 were transplanted into Rag2’~y, ’ Kit" mice and 
blood lineages were analysed 16 weeks after. 
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Extended Data Figure 1 | Analysis of Csflr reporter expression in fetal 
progenitor cells in Csflr'“* Rosa26**”. a, Schematic representation of the 
different haematopoietic and non-haematopoietic sites dissected in the mouse 
embryos: yolk sac (YS), aorta-gonado-mesonephros (AGM) region, fetal liver 
and head. b, Experimental design for fate-mapping analysis of Csf1r-expressing 
cells. Arrows indicate analysed time points c, Kit and CD45 expression on 
YFP* cells from Csf1r'"’Rosa26""” embryos (E8.25, n = 7; E8.5, n = 4; E9.25- 
E9.5, n = 16; E10.25, n = 9; E10.5, n = 5; E11.5, n = 8; E12.5,n = 5). d, Number 


959 Po2n2 
BPO QTHORN 


Fetal 
Liver 


REN See Oe eee 
AAA 


of YFP* Kit* CD45" cells per organ/region and developmental time points 
(mean + s.e.m.) in CsfIr'’Rosa26""” embryos (upper panel). Number of 
YFP‘ AA4.1* Kit "CD45" cells per embryonic region and developmental time 
points (mean + s.e.m.) in Csfl r'* Rosa26"? embryos (lower panel). e, AA4.1 
and Kit expression on YFP" cells from CsfIr'“"Rosa26""” embryos (upper 
panel) and from Csf1r“°"°""""Rosa26"*? embryos pulsed with OH-TAM at 
E8.5 (lower panel). 
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Extended Data Figure 2 | Fate-mapping analysis of Csflr-expressing cells. n= 16; E10.25, n = 9; E10.5, n = 5; E11.5, n = 8; E12.5, n = 5); lower panel, 


a, Experimental design for fate-mapping analysis of Csflr-expressing cells. Csfre"'reM"rRosa26**? embryos pulsed at E8.5 (E9.5, n = 3; E10.25, 
Arrows indicate analysed time points. b, YFP expression on live cells from n= 3; E10.5, n = 4; E11.5, n = 4; E12.5, n = 9). d, Percentage of YEP* cells 
sf?" Rosa 26"? embryos pulsed at E6.5 with OH-TAM and analysed among AA4.1* Kit*CD45"° cells (YFP labelling efficiency) per embryonic 
at E10.5 (n = 2) and E12.5 (n = 4). c, Percentage of YEP* cells among organ/region and developmental time points (mean + s.e.m.). Upper 

Kit CD45" cells (YEP labelling efficiency) per organ/region (mean + s.e.m.). _ panel, CsfIr'“"* Rosa26"*” embryos; lower panel, Csf1r°""M*"Rosa26""” 
Upper panel, Csf1 r'’Rosa26"? embryos (E8.25, n = 7; E8.5, n = 4; E9.5, embryos pulsed at E8.5. 
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Extended Data Figure 3 | Csfir* progenitors have erythro-myeloid from Csfr""™""Rosa26""* embryos pulsed with OH-TAM at E8.5. CFU- 


potential ex vivo. a, Sorting strategy for CFU-C (colony forming unit-culture) _ erythroid and/or megakaryocyte (E/Mk); CFU-granulocyte and/or monocyte/ 
assays for E9 CsfIr'“’Rosa26""” yolk sac (upper panel) and E12.5 fetal liver macrophage (G/M); CFU-mix, at least three of the following: G, E, M and 
from Csf17“"""’*"Rosa26"*? embryos pulsed with OH-TAM at E8.5 (lower Mk. c, Morphological validation of colony types obtained from E9 yolk sac 
panel). Dead cells were excluded based on Hoechst 33258 incorporation and, — CsfIr'® YFP* AA4.1* Kit* CD45'° CFU-C assays. Representative images from 
after doublet exclusion, cells were gated based on CD45 and Kit expression. May-Griinwald-Giemsa stained cytospin preparations of mixed, E/Mk and 
AA4.1*Kit* CD45” and YEP* AA4.1* Kit* CD45" cells were isolated from G/M colonies. Black arrowhead, macrophages; granulocyte pathway, blue 
Cre” and Cre” embryos respectively. b, Mean CFU-C frequency from three arrows; erythroid and megakaryocyte pathway, red arrows. Scale bar, 10 jm. 
independent experiments each of E9 Csf1r'“"’Rosa26*"” YS and E12.5 fetal liver 
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Extended Data Figure 4 | Analysis of CsfIr reporter expression in fetal 
macrophages and red blood cells in Csflr'““Rosa26*™ embryos. a, F4/80 
and CD11b expression on YFP* CD45* from yolk sac, head (brain for E11.5), 
limbs and liver of Csflr'"Rosa26""” embryos (E8.5, n = 4; E9.5, n = 16; 
E10.5, n = 5; E11.5, n = 9; E12.5, n = 5). Dashed line represents FMO 
(fluorescence minus one) control. b, Percentage of macrophages (F4/ gobrishty 
among YFP* cells, mean + s.e.m., in Csflr'“""Rosa26""” embryos (left) and 
in Csf1r“e"'r’M" Rosa26""” embryos pulsed with OH-TAM at E8.5 ( (right). 
See also Supplementary Table 1. ¢, Percentage of YEP* cells among F4/80°"8"" 


cells (YFP labelling efficiency) per embryonic organ/region and developmental 
time points (mean + s.e.m.). Left panel, CsfIr'"’Rosa26""” embryos (E8.25, 
n= 7; E8.5, n= 5; E9.5, n = 15; E10.25, n = 9; E10.5, n = 5; E11.5, n = 9; 
E12.5, n = 5); right panel: Csfr’""""Rosa26"*” embryos pulsed at E8.5 
(E9.5, n = 3; E10.25, n = 3; E10.5, n = 4; E11.25, n = 4; E12.5,n = 9). d, YFP 
expression in erythrocytes (CD45~ Ter119*) from yolk sac and fetal liver 

of Csflr'"Rosa26""” embryos (left) and Csf1r“°""""Rosa26""* embryos 
pulsed with OH-TAM at E8.5 (right). 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


Bone marrow 


S Fetal liver erythrocyte: Fetal liver vs blood RBCs 
a Blood b progenitors Cc ereryt ytes bi 
100: 4 + + 10: 
ERA CD45* Tert19 @ Fetal Liver 
10 Gl RBCs El Blood 
= 8 = = 
= 9 2 2 2 
4 9 9 9B 
2 Oo 60 ° ts) 
8 6 g g £ 
g 3 a 3 
B 4 a. 4 a 4 a 
° a a a 
a. rs vm iv 
o | = 95 2 x > 
> 
0 
209 Was 2009 yas 2099 ps 2290 ys. 229 yas 2 6949 .9.9,0 
EEE PAS Ge” MME IIE MOE ELE MO GS RS i gr it ot SLE’ LEE 


YFP Labeling efficiency 


Ec eh yt 
wr Ser os 


ey 


YFP Labeling efficiency YFP Labeling efficiency 


Grt- Gri Grt- 
Gr1* MHC II* ; MHC II Grit MHC II* MHC II 
P | zl | gS: | li \ - | 
| |- a | + | + flyy dl 
* - | ws | }\ |» || 
YFP —_—_——$>P—$———> FP 
Siglec-F- 
Siglec-F* Ly-6G- Ly-6G* Siglec-F* 
wie a, le | 
; —_ L ji | 
Bi i J L\ a ee ip ee Le 
YFP —<$<$<$_$ $< $$ > YFP ————____—_—> 
CD11c CD11ct CD11c 
CD64- CD11chen CDe4ew = CD64!ow 
| Na be “ { 
| kk |z - | 
| | | |: + | 
‘ "| ieee || [| oe le I\ 
— Ra ——— YFP ——<<$_—> YFP 
B cells NK cells NK cells B cells NK cells 
| ft |. | pa |= 
1 ie | a 0 
; fe | AV | 
YFP ——>>_——» “YFP PP 


Liver Lung Spleen 
F4/80' myeloid cells F4/80' CD64* myeloid cells F4/80' CD64'™ myeloid cells 
F4/80>rsht macrophages F4/80°"s"t macrophages F4/805"'sht macrophages 
(Kupffer cells) (Alveolar Macrophages) (Red Pulp Macrophages) 
; EN FMO ‘es FMO 
CD64 © “CD64 “CD64 


Extended Data Figure 5 | Analysis of Flt3 reporter expression in blood 
leucocytes, stem/progenitor cells, fetal red blood cells, and adult liver, lung 
and spleen in Fit3“Rosa26""” mice. a, YFP labelling efficiency in blood 
lineages at different embryonic and adult time points (E14.5, n = 9; E16.5, 
n= 9;E18.5, n = 7; P8,n = 7; 4-week-old, n = 6; 12-week-old, n = 9; 40-week- 
old, n = 7) in Fit3“Rosa26**” mice are shown. Lymphocytes were gated as 
CD3*/CD19*, granulocytes (CD11b*Gr1*CD115"), Grl* monocytes 
(CD11b*Gr1*CD115*), Grl~ monocytes (CD11b*Grl~ CD115*) and red 
blood cells (RBCs, CD45" Ter119*). b, YEP labelling efficiency in bone marrow 
LT-HSCs, ST-HSCs, MPPs and Lin Scal” Kit* progenitors in 4-week-old 

(n = 3) and 12-week-old (n = 6) Flt3“’Rosa26""” mice. c, YFP labelling 
efficiency in fetal liver red blood cell progenitors (CD45*Ter119*) and red 


blood cells (CD45~ Ter119*) in Flt3“°Rosa26"" mice (E14.5, n = 5; E16.5, 
n= 5; E18.5, n = 7), and comparison of YFP labelling efficiency in fetal liver 
and blood red blood cells in FIt3“’Rosa26"” mice at E14.5 (n = 5), E16.5 

(n = 5) and E18.5 (n = 7). d, Expression of Grl and MHC IL, Ly-6G and Siglec- 
F, CD11c and CD64, and Nkp46 and CD19 among F4/80'°CD11b™ myeloid 
cells in the liver. Histograms represent Fit3°” YFP labelling efficiency in the 
following defined populations: granulocytes (Gr1 * MHC II or Ly-6G"), 
eosinophils (Siglec-F*), dendritic cells (CD11c*), B cells (CD19*) and NK 
cells (Nkp46*) (n = 3). e, Analysis of F4/80'°CD11b™ myeloid cells in the lung 
as in b. f, Analysis of F4/80'°CD11b™ myeloid cells in the spleen as in 

b. g, Expression of CD64 in F4/80°"8"" macrophages and in F4/80"° myeloid 
cells in the liver, lung and spleen (FMO, fluorescence minus one). 


©2015 Macmillan Publishers Limited. All rights reserved 


a Fetal liver 


E14.5 Csf1Mericremer Rosa26YFP, pulsed at E8.5 


CD45* YFP* YFP+ F4/gobright 


I: Kites F4/80- 
Il: Kit* F4/80- €. és 


YFP* CD11 btish 


_ F4/80 


counts 


CD1ib * ‘ot 


Skin 


LETTER 


7 
a igs 


CD1ib ——> — Gr 


Cz 


Extended Data Figure 6 | Characterization of fetal F4/80'°CD11b™ myeloid 
cells in liver, lung and skin. a, F4/80, Kit, CD11b and Gr1 expression on 
YFP*CD45* cells in the fetal liver at E14.5 in Csf1r“°"™"™*"Rosa26""” 
embryos pulsed at E8.5 (left panel). Representative images of May-Griinwald- 
Giemsa stained cytospin preparations of fetal liver YEP” F4/80°"8™ and 
YEP*CD11b™ cells sorted from E14.5 Csf1r“°""""""Rosa26""” embryos 
pulsed with OH-TAM at E8.5 (right panel). Scale bar, 10 um. b, c, F4/80, 
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post-natal lung (b) and skin (c) in Csft7°"™"™*"Rosa26""” embryos pulsed 
with OH-TAM at E8.5 (green) and FIt3“’Rosa26""” embryos (orange). 
Representative images of May-Griinwald-Giemsa stained cytospin 
preparations of lung YFP*F4/80°"8"' and YFP* CD11b™F4/80"° (b) and skin 
YEP* F4/80°"* and YFP* Kit* F4/80° CD11b~ mast cells (c) sorted from 
E18.5 Flt3“Rosa26""” embryos and E16.5 Csf1r"""" Rosa26""” 
embryos pulsed with OH-TAM at E8.5. Scale bar, 10 um. 
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Extended Data Figure 7 | Adult BM transplantation reconstitutes the 
haematopoietic system but does not replace tissue-resident F4/80°"2"* 
macrophages. a, Schematic representation of transplantation experiments. 
LT-HSCs isolated from bone marrow of panRosa26’"” donor mice were 
injected into Rag2’~y, ’ Kit” recipients (approximately 1,000 cells per 
recipient). Eight weeks after transplantation stem cells, myeloid progenitors, 
monocytes and macrophages of recipient mice were analysed for donor 
chimaerism. b, Long-term or short-term haematopoietic stem cells (LT-HSCs, 
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ST-HSCs), multipotent progenitors (MPPs), common myeloid progenitors 
(CMPs), granulocyte-monocyte progenitors (GMPs), megakaryocyte- 
erythrocyte progenitors (MEPs), and circulating Ly6C™ and Ly6C° monocytes 
were isolated from transplanted Rag2 ‘py, ’ Kit” mice and analysed 

for YFP expression. c, F4/80°""* macrophages and F4/80'° myeloid cells 

in spleen, liver, lung, pancreas, epidermis and brain were analysed for 

YFP expression. 
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Extended Data Figure 8 | Analysis of fetal stem/progenitor cells and fetal 
macrophages in Tie2”°"'"™*"Rosa26*"” embryos pulse-labelled from 
E6.5 to E10.5. a, Experimental design for fate-mapping analysis of 
Tie2MericreMer Ro sq26*tP embryos pulse-labelled at E6.5, or E7.5, or E8.5, or 
E9.5 or E10.5. b, c, Representative flow cytometry of fetal liver stem/progenitor 
cells (b) and of fetal macrophages (c) in the yolk sac, head region, and embryo 
body at E12.5, injected at E6.5 or at E10.5. d, Representative images of 
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May-Griinwald-Giemsa stained cytospin preparations of sorted YFP* and 
YEP” CD45* F4/80°"8" macrophages of the embryo proper or the head region 
of E13.5 Tie2Meicr™M*rRosq26? embryos pulsed at E7.5. Scale bar, 10 pm 

e, Quantification of the percentage of YFP~ stem/progenitor cells in the fetal 
liver and macrophages in yolk sac, brain (head) and embryo body at E12.5. 
Embryos were labelled at E6.5 (n = 5), or E7.5 (n = 7), or E8.5 (n = 4), or E9.5 
(n = 5) or E10.5 (n = 7) and analysed at E12.5 (mean + s.d.). 
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Extended Data Figure 10 | Analysis of E9.5 YS progenitor cells in 
Tie2Me'eM" Rosa26**” embryos pulse-labelled at E7.5. a, Experimental 
design for fate-mapping analysis of Tie2-expressing cells in 
Tie2Mer'reMRosq26""* embryos. Embryonic cells were pulse-labelled by 
tamoxifen (TAM) administration into pregnant TiegMer'creMer mice at E7.5. 
Yolk sac (YS) and embryo proper (EP) of E9.5 Tie2M°"""*"Rosa26."? 
embryos were analysed by flow cytometry. b, Quantification of total living cells 
and YFP* living cells in yolk sac and embryo proper of analysed embryos 
(mean + s.d., 2 = 5). c, Flow cytometry analysis of Kit*CD45"° cells among 
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total living cells (black) or living YFP™ cells (blue) in yolk sac and embryo 
proper of a representative E9.5 Tie2™°"'°""M*"Rosa26"'” embryo (left) and 
quantification of all analysed embryos (right; mean + s.d., = 5).d, Analysis of 
F4/80* fetal macrophages among CD45" cells (black) or YFP*CD45* cells 
(blue) in yolk sac and embryo proper (mean + s.d., n = 5) and quantification 
of all analysed embryos (right; mean + s.d., n = 5). e, Percentage of YFP cells 
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Role of TP53 mutations in the origin and evolution of 
therapy-related acute myeloid leukaemia 


Terrence N. Wong'*, Giridharan Ramsingh”*, Andrew L. Young**, Christopher A. Miller*, Waseem Touma, John S. Welch’, 
Tamara L. Lamprecht', Dong Shen®, Jasreet Hundal’, Robert S. Fulton*, Sharon Heath’, Jack D. Baty’, Jeffery M. Klco®, Li Ding’, 
Elaine R. Mardis+>, Peter Westervelt!°, John F. DiPersio!°, Matthew J. Walter, Timothy A. Graubert'°, Timothy J. Ley>®, 


Todd E. Druley’, Daniel C. Link>® & Richard K. Wilson**? 


Therapy-related acute myeloid leukaemia (t-AML) and therapy-related 
myelodysplastic syndrome (t-MDS) are well-recognized complications 
of cytotoxic chemotherapy and/or radiotherapy’. There are several 
features that distinguish t-AML from de novo AML, including a higher 
incidence of TP53 mutations”, abnormalities of chromosomes 5 
or 7, complex cytogenetics and a reduced response to chemotherapy*. 
However, it is not clear how prior exposure to cytotoxic therapy influ- 
ences leukaemogenesis. In particular, the mechanism by which TP53 
mutations are selectively enriched in t-AML/t-MDS is unknown. Here, 
by sequencing the genomes of 22 patients with t-AML, we show that 
the total number of somatic single-nucleotide variants and the per- 
centage of chemotherapy-related transversions are similar in t- AML 
and de novo AML, indicating that previous chemotherapy does not 
induce genome-wide DNA damage. We identified four cases of t-AML/ 
t-MDS in which the exact TP53 mutation found at diagnosis was also 
present at low frequencies (0.003-0.7%) in mobilized blood leuko- 
cytes or bone marrow 3-6 years before the development of t-AML/t- 
MDS, including two cases in which the relevant TP53 mutation was 
detected before any chemotherapy. Moreover, functional TP53 muta- 
tions were identified in small populations of peripheral blood cells 
of healthy chemotherapy-naive elderly individuals. Finally, in mouse 
bone marrow chimaeras containing both wild-type and Tp53*’~ hae- 
matopoietic stem/progenitor cells (HSPCs), the Tp53*'~ HSPCs pre- 
ferentially expanded after exposure to chemotherapy. These data 
suggest that cytotoxic therapy does not directly induce TP53 muta- 
tions. Rather, they support a model in which rare HSPCs carrying 
age-related TP53 mutations are resistant to chemotherapy and ex- 
pand preferentially after treatment. The early acquisition of TP53 
mutations in the founding HSPC clone probably contributes to the 
frequent cytogenetic abnormalities and poor responses to chemo- 
therapy that are typical of patients with t-AML/t-MDS. 

t-AML and t-MDS are clonal haematopoietic disorders that typically 
develop 1-5 years after exposure to chemotherapy or radiotherapy’. To 
understand better how prior cytotoxic therapy contributes to the high 
incidence of TP53 mutations and karyotypic abnormalities in t-AML/ 
t-MDS, we sequenced the genomes of 22 cases of t- AML, including one 
case that has been previously reported*. These data were compared to 
whole-genome sequence data previously reported for de novo AML* and 
secondary AML (s-AML) arising from MDS for which patients did not 
receive chemotherapy except hydroxyurea’*. Of the sequenced t-AML 
cases, 23% had rearrangements of MLL (also known as KMT2A), 23% 
had complex cytogenetics, and 36% had normal cytogenetics (Extended 
Data Table 1 and Supplementary Table 1). 

We predicted that DNA damage induced during exposure to cyto- 
toxic therapy would manifest itselfin t- AML genomes with an increased 


mutation burden. However, the total number of validated somatic single- 
nucleotide variants (SNVs) and genic (tier 1) somatic SNVs identified 
was similar to that for de novo AML and s-AML (Fig. 1a, b). Likewise, 
the number of small insertions or deletions (indels) in genic regions was 
similar in t-AML, de novo AML and s-AML (Fig. Ic). A previous study 
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Figure 1 | The mutational burden in t-AML is similar to de novo AML. 

a, Total number of validated tier 1-3 somatic SNVs in t-AML (n = 22), de novo 
AML (n = 49) and s-AML (n = 8). The mean ages of the t-AML, de novo AML 
and s-AML cohorts were 55.7, 51 and 54.6 years, respectively. b, Number 

of validated tier 1 somatic SNVs. c, Number of validated tier 1 small indels. 
d, Percentage of tier 1-3 somatic SNVs that are transversions. e, Mutational 
spectrum for all validated tier 1-3 somatic SNVs. f, Number of distinct clones 
per sample inferred from the identification of discrete clusters of mutations 
with distinct variant allele frequencies. g, Percentage of cases of t- AML (n = 52) 
or de novo AML (n = 199) harbouring non-synonymous mutations of the 
indicated gene. h, Percentage of cases of t-MDS (n = 59) or de novo MDS 

(n = 150) harbouring non-synonymous mutations of the indicated gene. ABC 
Fm, ABC family genes; NA, not available. +P < 0.05 by one-way analysis of 
variance (ANOVA). *P < 0.05 by Fisher’s exact test. Data represent the 
mean + standard deviation (s.d.). 
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showed that transversions are specifically enriched in relapsed AML 
after chemotherapy’. However, the percentage of transversions, and in 
fact of all six classes of SNVs, was similar in all three cohorts (Fig. 1d, e). 
Structural variants and somatic copy number alterations were uncom- 
mon in these t-AML cases (Supplementary Table 2 and Extended Data 
Fig. 1a). Moreover, the number of identifiable subclones in t-AML was 
similar to that observed in de novo AML (Fig. 1f and Extended Data 
Fig. 1b). Collectively, these data show that the mutation burden of t- AML 
genomes is similar to that of de novo AML genomes. 

We next asked whether the pattern of genes frequently mutated in 
t-AML/t-MDS is distinct from that observed in de novo AML/MDS. 
Whole-genome sequencing identified an average of 10.2 + 7.1 missense, 
nonsense, in-frame indel or frameshift mutations per t-AML genome 
(Supplementary Table 3). To define better the frequency of specific muta- 
tions in t- AML/t-MDS, we sequenced a panel of 149 AML/MDS-related 
genes in an additional 89 patients with t-AML or t-MDS (Supplemen- 
tary Table 4). We combined the whole-genome sequence data with the 
extension series to report on 52 cases of t- AML and 59 cases of t-MDS. 
Abnormalities of chromosome 5 or 7 or complex cytogenetics were pres- 
ent in 55.0% of cases (Extended Data Table 2 and Supplementary Table 1). 
The t-AML/t-MDS data were compared to 199 previously reported de 
novo AML genomes or exomes’, or 150 previously reported cases of de 
novo MDS in which extensive candidate gene sequencing was performed’. 
As reported previously, TP53 mutations are significantly enriched in 
t-AML/t-MDS compared with de novo AML/MDS (Fig. 1g, h and Sup- 
plementary Table 5). Interestingly, mutations of ABC transporter genes, 
a subset of which have been implicated in chemotherapy resistance, are 
also enriched in t-AML versus de novo AML. On the other hand, several 
well-defined driver gene mutations (that is, DNMT3A and NPM1) were 
significantly less common in t-AML. Thus, although the total mutation 
burden is similar, a distinct subset of mutated genes is present in t-AML/ 
t-MDS. 

TP53 is the most commonly mutated gene in t-AML/t-MDS, with 
33.3% of patients affected in our series (Fig. 1g, h); the vast majority of 
these mutations have previously been identified as pathogenic’. Multi- 
variate analysis revealed that TP53 mutations were associated with poor 
risk cytogenetics and a worse prognosis (Supplementary Tables 6, 7 and 
Extended Data Fig. 2), both hallmarks of t-AML/t-MDS. These obser- 
vations suggest a central role for TP53 mutations in the pathogenesis of 
many cases of t-AML/t-MDS. However, the mechanism by which TP53 


LETTER 


mutations are selectively enriched in t-AML/t-MDS is unclear. The muta- 
tion burden in the genomic region containing TP53 (including silent tier 1, 
and any tier 2 or tier 3 mutations) is similar between t-AML and de novo 
AML (Extended Data Fig. 1c). Thus, it is not likely that chemotherapy 
directly induces TP53 mutations. We recently reported that individual 
HSPCs accumulate somatic mutations as a function of age, such that by 
age 50, there are on average five coding gene mutations per HSPC". On 
the basis of these data and on current estimates that there are approxi- 
mately 10,000 haematopoietic stem cells (HSCs) in humans’’, we pre- 
dict that 44% of healthy individuals at 50 years of age may have at least 
one HSPC that carries a randomly generated, functional TP53 mutation 
(see Methods). TP53 has a central role in regulating cellular responses 
to genotoxic stress'*"”, and loss of TP53 provides a selective advantage 
for neoplastic growth'’. Together, these observations suggest a model 
in which rare HSPCs carrying age-related TP53 mutations are resistant 
to chemotherapy and expand preferentially after treatment (Extended 
Data Fig. 3). 

This model suggests the following testable predictions: (1) in patients 
with t-AML containing clonal TP53 mutations, HSPCs harbouring the 
specific TP53 mutation will be present long before the development of 
overt t-AML; (2) somatic TP53 mutations will be present in the HSPCs 
of some healthy individuals never exposed to cytotoxic therapy; and 
(3) HSPCs harbouring TP53 mutations will expand under the selective 
pressure of chemotherapy. 

To test the first prediction, we identified seven cases of t-AML/t-MDS 
with specific TP53 mutations for which we had leukapheresis or bone 
marrow specimens banked 3-8 years before the development of t-AML/ 
t-MDS (Extended Data Table 3). Of note, in all the cases, the TP53 muta- 
tion was clonal in the t-AML/t-MDS diagnostic sample. Current next- 
generation sequencing technology is limited in the detection of rare 
variant alleles owing to an intrinsic sequencing error rate of ~ 0.1% 
(ref. 19). To overcome this limitation, we introduced random barcodes 
during production of the sequencing libraries, such that sequence ‘read 
families’ containing unique barcodes are generated (Extended Data 
Fig. 4a). Using tumour DNA with a known TP53 mutation, we show that 
this assay can detect a variant allele with a frequency of 0.009% (Extended 
Data Fig. 4b, c). 

The specific TP53 mutation present in the diagnostic t- AML/t-MDS 
sample was identified in previously banked specimens in four out of the 
seven cases tested (see Supplementary Notes for case presentations). In 


Figure 2 | Biallelic TP53 mutations are early 


a c Chemotherapy 6 years re 
Stage 4B Autologous Autologous | + aML ==all : 1 mutational events in the AML cells of UPN 
Hodgkin’s transplant transplant 530447. Cli e al f 530447 Ch 
Chemo Chemo/XRT 6 years oe NUP98, TET2 ely EI COUISS DEES ; EMO; 
I r rl . chemotherapy; XRT, radiation therapy. b, Unique 
> @ adaptor sequencing of a leukapheresis sample 
Leukapharesis b ~1% cells obtained 6 years before the diagnosis of t- AML for 
specimen * * . . 
it each of the five clonal somatic SNVs identified in 
T the diagnostic t- AML sample. Genomic DNA from 
2nd autologous transplant AML diagnosis . a a 
a patient lacking these variants served as a control. 
b TP53 K139N TP53 R248Q CSMD1G192_—- NUP98 Q1532H TET2 K1299M Blue circles indicate the position of the variant 
2.0 an 2.0 2.0 2.0 A 2.0 SNV. c, Proposed model of clonal evolution to 
Jj goes 7 2 4 | t-AML in this case. 
e = Raw 
1.04 Seake) «| 104° 4 e | 1.04 @» 1.05 1.05 d 
08's .° . ped eee Keags 
4 iy 4 ° 4 oe 4 aa oe 7 
ve, 5 . 
04 ; 0 7 ; 0 aaa, Jo} aannliiegnae, 0+ siertnastte | 
0 200 0 200 0 200 0 100 200 100 200 
2.0 2.0 2.0 2.0 2.0 
= 1 4 z 4 
s Read 
ue 1.04 © 1.04 1.04 1.04 1.04 crnilies 
Ss 4 I © 4 ® 4 
Opec, ee 0 {spans 0 -}-atpaamacili 0} ane nga | 
0 200 0 200 0 200 0 100 200 100 200 
2.0 2.0 2.0 2.0 2.0 
1 1 1 7 Control 
1.04 1.04 1.04 1.04 1.04 read 
4 4 A a families 
1] ercepeeeenrty 0 -|_afiaapstinecearsti 0 {spate 0 {mayne 0+ 2ealinant 


200 0 
Position (bp) 


26 FEBRUARY 2015 | VOL 518 | NATURE | 553 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


the other three cases, we were unable to detect the diagnostic TP53 muta- 
tion in the previously banked blood or bone marrow sample; it is not 
clear whether these mutations were present but below our limit of detec- 
tion or were truly absent. Patient 530447 developed t-AML after an 
autologous stem cell transplant for refractory Hodgkin’s lymphoma 
(Fig. 2a). The diagnostic t-AML sample carried biallelic mutations of 
TP53, missense mutations of TET2 and NUP98, a silent mutation of 
CSMDI, and a subclonal KRAS mutation. Analysis of a leukapheresis 
sample obtained 6 years before the development of t-AML revealed that 
both TP53 mutant alleles were present with a variant allele fraction (VAF) 
of approximately 0.5% (Fig. 2b). The CSMD1 mutation was also present 
at the same VAF and is probably a passenger mutation. However, two 
potential driver mutations (TET2 and NUP98) were not detectable in 
the previously banked sample. Thus, these data show that, in this patient, 
the biallelic TP53 mutations preceded the development of t-AML by at 
least 6 years and antedated the development of the TET2 and NUP98 
mutations (Fig. 2c). In a second case (unique patient number (UPN) 
341666), a heterozygous TP53 R196* mutation was identified in mobi- 
lized peripheral blood leukocytes 3 years before the development of 
t-MDS at a frequency of 0.1%, preceding the acquisition of a RUNX1 
mutation (Extended Data Fig. 5). 

In two of the four cases, the previously banked sample was obtained 
before the initiation of chemotherapy. Patient 967645 developed t-AML 
5 years after the diagnosis of marginal zone lymphoma (Fig. 3a). The 
diagnostic t-AML sample contained a homozygous TP53 Y220C muta- 
tion. Using a droplet digital polymerase chain reaction (ddPCR) assay, 
we identified the same TP53 Y220C mutation in a bone marrow sample 
obtained before any chemotherapy at a frequency of 0.0027% (average 
of two independent experiments) (Fig. 3b). We next asked whether other 
mutations in the diagnostic t-AML sample were also present in this 
previously banked sample (Supplementary Table 8). We focused on the 
G155S mutation in SNAP25; this mutation is probably non-pathogenic 
as SNAP25 is not expressed in AML samples’. Indeed, we identified the 
SNAP25 G155S mutation in the previously banked bone marrow sam- 
ple with a similar VAF (0.0029%) as that for TP53 Y220C (Fig. 3c). Of 
note, deletion (del)(5q) and del(7q) were subclonal at diagnosis (present 
in 54% and 38% of metaphases, respectively) (Supplementary Table 1). 
Collectively, these data provide evidence that an HSPC harbouring a 
TP53 Y220C mutation preferentially expanded after chemotherapy with 
the subsequent acquisition of del(5q) and then del(7q) (Fig. 3d). Of note, 
we found two other cases of t-AML/t-MDS with clonal TP53 mutations 
but subclonal del(5q), del(5) and/or del(7) (UPNs 756582 and 837334, 
Supplementary Table 1). Together, these data suggest that TP53 muta- 
tions precede the development of these characteristic cytogenetic abnor- 
malities of t-AML/t-MDS. 

Ina second case, patient 895681 developed t-MDS 5.5 years after the 
initiation of chemotherapy for non-Hodgkin’s lymphoma (Fig. 3e). The 
diagnostic t-MDS sample contained a clonal TP53 H179L mutation. 
Using ddPCR, we identified TP53 H179L at a VAF of 0.05% in a bone 
marrow sample taken before the initiation of cytotoxic therapy (Fig. 3f). 
Thus, as with patient 967645, an HSPC carrying a functional TP53 muta- 
tion was present before cytotoxic therapy exposure, later giving rise to 
the malignant t-AML/t-MDS clone (Fig. 3g). 

To determine whether HSPCs harbouring TP53 mutations are pres- 
ent in healthy individuals, we analysed peripheral blood leukocytes from 
20 elderly (68-89 years old) cancer-free donors who had not received 
prior cytotoxic therapy. We limited our sequencing to exons 4-8 of TP53 
since the majority of pathogenic mutations in TP53 are located in these 
exons. Using our unique adaptor sequencing assay, we identified TP53 
mutations in 9 of 19 evaluable cases, with VAFs ranging from 0.01% to 
0.37% (Extended Data Table 4). Of note, since we did not sequence the 
entire coding region of TP53, it is likely that our study underestimates 
the true frequency of healthy elderly individuals harbouring HSPCs with 
TP53 mutations. ddPCR confirmed the presence of the TP53 mutation 
in all three cases that were tested (Extended Data Fig. 6). Interestingly, 
the majority of the TP53 mutations identified are known pathogenic 
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Figure 3 | HSPC clones harbouring somatic TP53 mutations are detected in 
patients before cytotoxic therapy exposure. a, Clinical course of case 967645. 
b, Dot plots of ddPCR of a diagnostic t-AML sample from case 967645, a 
bone marrow sample from this patient obtained 5 years before the development 
of t-AML (before any cytotoxic therapy), or a control sample from a patient 
lacking a mutation in TP53. Droplets containing only the TP53 Y220C allele are 
highlighted in orange, droplets containing wild-type TP53 (with or without 
TP53 Y220C) are highlighted in blue; empty droplets are grey. The number of 
droplets in each gate is indicated. Data are representative of two independent 
experiments. c, Dot plots of ddPCR data for SNAP25 G155S using the same 
genomic DNA as in b. d, Proposed model of clonal evolution to t-AML in case 
967645. e, Clinical course of case 895681. Chemo, chemotherapy; DLBCL, 
diffuse large B-cell lymphoma; FFPE, formalin-fixed parafin-embedded; XRT, 
radiotherapy. f, Dot plots of ddPCR data of the diagnostic t- MDS sample 
from case 895681, a bone marrow FFPE sample from this patient obtained 
5.5 years before the development of t-MDS (before any cytotoxic therapy), or a 
control FFPE sample obtained from a patient lacking a mutation in TP53. 
The labelling scheme is the same as in b. g, Proposed model of clonal evolution 
to t-MDS in case 895681; the diagnostic t- MDS sample contained a subclonal 
ETV6 mutation. 


mutations previously implicated in cancer. These data suggest that func- 
tional TP53 mutations may confer (even in the absence of cytotoxic 
therapy) a subtle competitive advantage that results in modest HSPC 
expansion over time. 

To test directly the hypothesis that functional TP53 mutations confer 
a competitive advantage after chemotherapy, we generated mixed bone 
marrow chimaeras containing both wild-type and Tp53"'~ cells (Fig. 4a). 
In mice treated with vehicle control, we observed a non-significant trend 
towards an increased Tp53*/~ donor contribution to haematopoiesis 
(Fig. 4b-e). Whether longer follow-up would confirm a subtle competitive 
advantage, as suggested by the expansion of TP53 mutant HSPC clones 
in elderly healthy individuals, will require additional study. Regardless, 
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Figure 4 | Heterozygous loss of TP53 confers a clonal advantage to HSCs 
after exposure to ENU. a, Experimental schema. Bone marrow chimaeras 
were generated by transplanting a 7 to 1 ratio of wild-type to Tp53*/~ 

bone marrow into irradiated syngenic recipients. After haematopoietic 
reconstitution (5 weeks), mice were treated with ENU or vehicle control as 
indicated. b-d, Shown is the percentage of total leukocytes (b), Gr-1* 
neutrophils (c) or B2207 B cells (d) that were derived from Tp53"! ~ cells. 

e, Percentage of Kit” lineage Sca* (KSL) cells in the bone marrow 12 weeks 
after ENU exposure that were derived from Tp53*'~ cells. Data represent 
the mean + standard error of the mean (s.e.m.) of 11 and 14 mice in the ENU 
and vehicle cohorts, respectively. Peripheral chimaerism was analysed using 
two-way ANOVA and KLS chimaerism was analysed using an analysis of 
covariance (ANCOVA). 


upon treatment with N-ethyl-N-nitrosourea (ENU), Tp53*/~ HSPCs 
show a competitive advantage. Importantly, a previous study similarly 
showed that Tp53*/~ HSCsalso havea competitive advantage after irra- 
diation, which appeared to be due, at least in part, to reduced irradiation- 
induced senescence in Tp53*/~ HSCs”. 

There is increasing evidence that cancers undergo clonal evolution 
under the selective pressure of chemotherapy”. For example, the clonal 
architecture of de novo AML is dynamic, with certain (often minor) sub- 
clones becoming dominant at relapse after chemotherapy’. We show 
that HSPCs that acquire heterozygous TP53 mutations as a function of 
normal ageing are also subject to Darwinian selection upon exposure to 
cytotoxic therapy, ultimately resulting in the expansion of HSPCs with 
these mutations. The high frequency (nearly 50%) of elderly individuals 
with detectable heterozygous TP53 mutations in their circulating leu- 
kocytes far exceeds the prevalence of AML or MDS in this age group. 
Clearly, additional mutations, including mutation of the second TP53 
allele, are needed for transformation to AML or MDS. Consistent with 
this observation, only a minority of patients with Li-Fraumeni syndrome, 
most of whom harbour germline heterozygous TP53 mutations, develop 
AML or MDS”. This model provides a potential mechanism for the 
high incidence of TP53 mutations in t- AML/t-MDS™. The TP53 muta- 
tion in the founding clone probably contributes to the frequent cytoge- 
netic abnormalities and poor response to chemotherapy that are typical 
of t-AML/t-MDS. For t-AML/t-MDS cases that do not harbour TP53 
mutations, it will be important to determine whether different age-related 
mutations also confer a competitive advantage to HSPCs that are exposed 
to cytotoxic therapy, and to define the nature of these mutations. 
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Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 


Patient characteristics. For the whole-genome sequencing study, we intentionally 
selected the original 22 cases of t-AML to have minimal numbers of cytogenetic 
abnormalities. However, the additional 89 cases of t-AML/t-MDS were randomly 
selected from those samples with sufficient tumour and skin DNA. All patients were 
selected from a larger cohort of adult AML and MDS patients enrolled in a single 
institution tissue banking protocol that was approved by the Washington Univer- 
sity Human Studies Committee (WU HSC#01-1014). Written informed consent 
for whole-genome sequencing was obtained from all study participants. Patients 
were treated in accordance with National Comprehensive Cancer Network (NCCN) 
guidelines (http://www.nccn.org) with an emphasis on enrolment in therapeutic 
clinical trials whenever possible. Clinical data for all patients, including the pre- 
existing condition requiring cytotoxic therapy, the cytotoxic therapy received before 
the t-AML/t-MDS diagnosis, cytogenetics, treatment approach and outcomes data, 
are presented in Extended Data Tables 1 and 2 and Supplementary Table 1. Peripheral 
blood leukocyte genomic DNA from cancer-free individuals (median age = 75.3 
+ 6.6 years) was obtained as part of a Washington University Institutional Review 
Board-approved protocol. All subjects had no previous history of invasive cancer 
or treatment with cytotoxic therapy, as determined by the medical history. 
Whole-genome sequencing and variant detection. A previously described pro- 
cedure’’ was followed for library construction and whole-genome sequencing. Briefly, 
Illumina DNA sequencing was used to generate sequence that covered the haploid 
reference at a depth between 30.51 and 72.60 (Supplementary Table 9). Sequence 
data were aligned to reference sequence build NCBI human build 36 using BWA 
v.0.5.5 (ref. 26) (params: -t 4) then merged and deduplicated using Picard v.1.29. 
We detected SNVs using the intersection of SAMtools v.r963 (ref. 27) (params: -A -B) 
and Somatic Sniper v.0.7.3 (ref. 28) (params: -q 1 -Q 15), and filtered to remove false 
positives (params: min-base-quality 15, min-mapping-quality 40, min-somatic-score 
40). Indels were detected using GATK version 5336 (ref. 29) unioned with Pindel 
v.0.5 (ref. 30). Somatic copy number alterations were detected using copyCat v.1.5 
(http://github.com/chrisamiller/copycat). We detected structural variants using Break- 
Dancer v.1.2 (ref. 31) and SquareDancer v.0.1 (https://github.com/genome/genome/ 
blob/master/lib/perl/Genome/Model/Tools/Sv/SquareDancer.pl), followed by assem- 
bly with Tigra-SV (https://github.com/genome/tigra-sv). SciClone (in review; http:// 
github.com/genome/sciclone) was used to infer the subclonal architecture of all 
whole-genome sequencing samples. 

Validation and extension sequencing with variant detection. We used custom 
sequence capture arrays from Roche Nimblegen that targeted variants detected by 
whole-genome sequencing and extended this array to cover all coding exons from 
an additional 149 genes of interest (Supplementary Table 4). Libraries were pre- 
pared, sequence was generated, and somatic alterations identified as described for 
whole-genome sequencing with the addition of VarScan v.2.2.6 (ref. 32) (params: 
-min-var.-freq 0.08-p-value 0.10-somatic-p-value 0.01 -validation) as a variant caller 
for both SNVs and indels. On average, genes were covered with a depth of 58.3 (Sup- 
plementary Table 10). Biallelic TP53 mutations in case 530447 were confirmed with 
PCR amplification of the genomic region containing both somatic mutations from 
the diagnostic t-AML sample. The resulting amplicons were cloned into the pCR- 
TOPO plasmid vector (Life Sciences) and sequenced using Sanger sequencing. 
Statistical analyses. Fisher’s exact tests were used to evaluate the association between 
pairs of dichotomous variables, with a significant right-sided P value indicating a 
positive relationship and a significant left-sided P value indicating a negative rela- 
tionship. The relationship between overall survival and each discrete measure was 
tested with Kaplan-Meier survival analyses with separate analyses for the AML 
and MDS groups. Age at diagnosis was discretized into quartiles for each group. 
Multivariate proportional-hazards regression models were created separately for 
the AML and MDS groups. All variables with log-rank P values of 0.20 or less in 
the Kaplan-Meier analyses were included in the first step. In successive steps, the 
variable with the largest P value was removed and the model re-run until all remain- 
ing variables had P values of 0.05 or less. Two-way interactions among the remain- 
ing variables were examined. Variables removed in earlier steps were added back to 
the model one at a time to determine if they significantly improved the final model. 
The proportionality assumption was evaluated for each variable in the final models. 
Rare variant detection using unique adaptor next-generation sequencing. Amp- 
licons approximately 200 bp in length were prepared from patient genomic DNA 
samples using primers designed to amplify genomic regions harbouring known 
tumour-specific SNVs (Supplementary Table 11). These amplicons were prepared 
for next-generation sequencing (NGS) using the Illumina TruSeq DNA Sample 
Preparation Kit (Illumina Catalog #FC-121-2001) replacing the kit adapters with 
adapters containing a random nucleotide index sequence. Libraries were quanti- 
fied using the Agilent qPCR NGS Library Quantification Kit, Illumina GA (Agilent 
Technologies Catalog #G4880A). Using this quantification, each library was diluted 
to ensure that each random index would be observed in multiple sequenced reads**™*. 
Each diluted library was amplified and sequenced on the Illumina MiSeq platform. 


Sequenced reads containing the same index sequence were grouped together cre- 
ating ‘read families’ in a manner similar to established methods**. Reads within a 
read family were aligned against each other to filter out stochastic sequencing errors 
generating an error-corrected read family consensus sequence. Each consensus 
sequence was locally aligned to UCSC hg19/GRCh37 using bowtie2 (ref. 35) with 
the default settings. The aligned read families were processed with Mpileup” using 
the parameters -BQ0O -d 10000000000000. Next, variants were called with VarScan* 
using the parameters -min-coverage 10000-min-reads2 10-min-avg-qual 0-min- 
var.-freq 0-p-value 1. Variant allele frequencies for the expected mutations and the 
background error rate were visualized using IGV* and graphically represented using 
ggplot2 (ref. 37). Variant coordinates are displayed in hg18/GRCh36. 

Detection of somatic TP53 mutations in cancer-free subjects. Amplicons were 
prepared from healthy control genomic DNA samples using primers designed to 
amplify exons 4-8 of TP53 (Supplementary Table 11). Patient-specific barcodes, 6 
nucleotides in length, were appended to the 5’ end of each primer to enable pooling 
of multiple samples for sequencing. Amplicons generated from each TP53 exon/ 
patient sample combination were generated as previously described and purified 
products were pooled in equimolar amounts. The pooled barcoded amplicons were 
prepared for error-corrected sequencing as previously described. Sequencing was 
completed on the Illumina Hi-Seq 2500 platform. Sequenced reads were demulti- 
plexed based on the known patient-specific barcode sequences using a 2-nucleotide 
hamming distance. Demultiplexed sequence reads were organized into read fam- 
ilies based on their random oligonucleotide index sequence and error-corrected as 
outlined previously. Read families composed of three reads or more were used for 
analysis. A binomial distribution of the substitution rate at each covered base in 
TP53 was used to identify individuals with somatic TP53 mutations. A variant was 
called ifit met the following criteria: (1) the binomial P value was less than 10 6. (2) 
the VAF was greater than 1:10,000; (3) at least 10,000 unique read families were 
sequenced at the position of interest; (4) at least 10 read families called the variant; 
and (5) the VAF in the individual was greater than five times the mean VAF for all 
individuals with greater than 10,000 coverage at that specific nucleotide. Read 
families from one patient sample (barcode GTACGGC) were removed from ana- 
lysis due to a high error rate. All somatic mutations were identified in this manner 
except for TP53 Y220C, which received closer manual inspection due to the large 
number of these mutations observed in our t-AML cohort. 

Extraction of genomic DNA from FFPE samples. Genomic DNA was extracted 
from FFPE samples with the QlAamp DNA FFPE Tissue Kit. Because of the effects 
of formalin fixation (cross-linking, DNA fragmentation, and so on), the amount of 
amplifiable DNA per sample was less than would be expected with Qubit fluoro- 
metric quantitation. As such, ddPCR was used to quantify the amount of amplifiable 
genomic DNA per sample such that the numbers of amplifiable domains tested were 
comparable between experimental and control samples. 

ddPCR. All primers and probes for ddPCR were designed by Bio-Rad as per MIQE 
guidelines**. In the case of TP53 Y220C, the TP53 region of interest in exon 6 was 
amplified with the following primers: 5’-TTTTCGACATAGTGTGGTG-3’ and 
5'-CTGACAACCACCCTTAAC-3’, The 5'-Hex/TGCCCTATGAGCCGCCT/Iowa 
Black FQ-3’ probe was used to detect the wild-type allele and the 5’-FAM/CCCT 
GTGAGCCGCCTGA/Iowa Black FQ-3’ probe was used to detect the mutant allele. 
All reagents were purchased from Bio-Rad. ddPCR was performed as previously 
described”. Specifically, quantitative PCR was performed with 900-1,800 nM forward 
and reverse primers, 250 nM mutant and wild-type genomic probes, and 2-4 ng pl 
genomic DNA. Quantitative PCR was performed with annealing/extension tem- 
peratures of 55.5-60 °C for 40 cycles. For droplet generation and analysis, we used 
the Bio-Rad QX100 and QX200 Droplet Digital PCR Systems. 

Owing to the fact that DNA degradation with time (that is, guanosine oxidation, 
cytosine deamination) is known to interfere with rare allele detection™, we only 
identified variant alleles present in droplets also lacking the reference allele. This 
greatly increased the specificity of our calls by removing droplets in which one of 
the two DNA strands may have been chemically altered. At low variant allele fre- 
quency, it was assumed that only a single variant allele was present in these ‘mutant 
only’ drops. Droplet allele distribution follows a Poisson distribution such that the 
number of droplets only containing a single allele (either variant or reference) can 
be determined from the percentage of empty droplets. Of note, droplets showing 
evidence of template independent amplification (that is, observed in ‘no template 
controls’) were counted as empty droplets. The VAF was determined from the frac- 
tion of the single allele droplets containing the variant allele. When appropriate, 
control samples were used to subtract potential background signal. VAFs calcu- 
lated in this method were highly concordant with VAFs obtained through unique- 
adaptor NGS. 

Generation and analysis of Tp53*’~ bone marrow chimaeras. Tp53*/~ and wild- 
type mice were inbred on a C57BL/6 strain. Bone marrow from Tp53‘’— mice 
expressing Ly5.2 was mixed at a 1:7 ratio with bone marrow from wild-type mice 
expressing Ly5.1 and transplanted retro-orbitally into lethally irradiated Ly5.1/5.2 
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recipients. Tp53‘’~ and wild-type donors were both age (6-12 weeks) and sex 
matched (female). A total of 3 X 10° cells were injected per recipient mice. Recip- 
ient mice were conditioned with 1,000-1,100 cGy froma !*’caesium source ata rate 
of approximately 95 cGy min“! before transplantation. Prophylactic antibiotics 
(trimethoprim-sulfamethoxazole; Alpharma) were given during the initial 2 weeks 
after transplantation. Five weeks after transplantation, mice were given two doses 
of ENU (100 mg kg *; Sigma-Aldrich) or vehicle alone intraperitoneally 9 days apart. 
Mice were stratified according to Tp53‘’ chimaerism and then randomly dis- 
tributed into the ENU and vehicle controls such that both cohorts had similar 
levels of Tp53*/~ chimaerismat baseline. ENU and placebo were delivered in a final 
solution with 10% DMSO, 90 mM sodium citrate, and 180 mM sodium phosphate, 
pH 5.0. Peripheral blood chimaerism was measured before ENU administration 
and 4-12 weeks after ENU administration. The investigator was not blinded. Mice 
were euthanized and bone marrow chimaerism analysed 12 weeks after ENU admin- 
istration. The desired cohort size was determined based on observations from previ- 
ously reported experiments”’, and two independent experiments were performed. 
Mice were maintained under standard pathogen-free conditions according to methods 
approved by the Washington University animal studies committee. 

Flow cytometry. Flow cytometry data were collected on a Gallios 10-colour, 3-laser 
flow cytometer (Beckman Coulter) and analysed with FlowJo software (Treestar). 
Cells were stained by standard protocols with the following antibodies (eBiosciences 
unless otherwise noted): Ly5.1 (A20, CD45.1), Ly5.2 (104, CD45.2), Ly6C/G (RB6- 
8C5, Gr-1), CD3e (145-2C11), CD45R (RA3-6B2, B220), CD11c (N418), TER-119, 
CD41 (MWReg30), CD117 (ACK2, c-Kit) and Ly-6A/E (D7, Sca). 

Estimation of TP53 mutation frequency in ageing stem cells. The frequency and 
profile of somatic single-nucleotide mutations in the HSCs of normal individuals 
have been previously measured". The somatic mutational burden is ageing-related, 
and the estimated rate of mutagenesis obtained from this study was 3.2 X 10°? 
mutations per nucleotide per year (95% confidence interval 2.4-4.0 X 107°) for the 
average nucleotide in the exome. Thus, we would predict an average 50 year old to 
have 1.6 X 10°’ mutations per position. These mutations would not be randomly 
distributed but biased (in particular towards C to T/G to A transitions). It has been 
previously proposed that an individual possesses approximately 10,000 distinct 
HSCs"’. We used a randomized Monte Carlo simulation to model the prevalence 
of somatic single-nucleotide mutations in healthy 50 year olds with 10,000 HSCs 
given a normal somatic mutational profile and mutation rate. Repeated simula- 
tion (n = 100,000) allowed us to predict the distribution of ageing-induced TP53 
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(NM_000546) somatic mutations. As expected, this simulation modelled a Poisson 
process. We classified TP53 mutations as likely to be functional if they fulfilled both 
of the following criteria. First, we analysed the mutations using the SIFT program 
(http://sift.jcvi.org) and required a SIFT score = 0.05. Second, we required that the 
somatic mutations be reported at least once if a nonsense mutation or at least twice 
if a missense mutation in the International Agency for Research on Cancer TP53 
database (http://p53.iarc.fr). On the basis of this simulation, we predict that 44% of 
50-year-old individuals harbour one or more HSCs with a functional TP53 mutation. 


25. Mardis, E. R. et al. Recurring mutations found by sequencing an acute myeloid 

eukemia genome. N. Engl. J. Med. 361, 1058-1066 (2009). 

26. Li, H.& Durbin, R. Fast and accurate short read alignment with Burrows—Wheeler 

transform. Bioinformatics 25, 1754-1760 (2009). 

27. Li,H. etal. The Sequence Alignment/Map format and SAMtools. Bioinformatics 25, 

2078-2079 (2009). 

28. Larson, D. E. et al. SomaticSniper: identification of somatic point mutations in 

whole genome sequencing data. Bioinformatics 28, 311-317 (2012). 

29. McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for 
analyzing next-generation DNA sequencing data. Genome Res. 20, 1297-1303 
(2010). 

30. Ye, K., Schulz, M. H., Long, Q., Apweiler, R. & Ning, Z. Pindel: a pattern growth 
approach to detect break points of large deletions and medium sized insertions 
from paired-end short reads. Bioinformatics 25, 2865-2871 (2009). 

31. Chen, K. etal. BreakDancer: an algorithm for high-resolution mapping of genomic 
structural variation. Nature Methods 6, 677-681 (2009). 

32. Koboldt, D.C. et al. VarScan 2: somatic mutation and copy number alteration 
discovery in cancer by exome sequencing. Genome Res. 22, 568-576 (2012). 

33. Kinde, |., Wu, J., Papadopoulos, N., Kinzler, K. W. & Vogelstein, B. Detection and 
quantification of rare mutations with massively parallel sequencing. Proc. Natl 
Acad. Sci. USA 108, 9530-9535 (2011). 

34. Schmitt, M. W. et al. Detection of ultra-rare mutations by next-generation 
sequencing. Proc. Nat! Acad. Sci. USA 109, 14508-14513 (2012). 

35. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nature 
Methods 9, 357-359 (2012). 

36. Thorvaldsdottir, H., Robinson, J. T. & Mesirov, J. P. Integrative Genomics Viewer 
(IGV): high-performance genomics data visualization and exploration. Brief. 
Bioinform. 14, 178-192 (2013). 

37. Wickham, H. ggplot2: Elegant Graphics for Data Analysis (Springer, 2009). 

38. Bustin, S. A. et a/. The MIQE guidelines: minimum information for publication of 
quantitative real-time PCR experiments. Clin. Chem. 55, 611-622 (2009). 

39. Hindson, B. J. et al. High-throughput droplet digital PCR system for absolute 
quantitation of DNA copy number. Anal. Chem. 83, 8604-8610 (2011). 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


a Patient Sample b 
SENRASRSASRSSHSE VSS STASBAES 0 20 40 60 80100 0 20 40 60 80 100 
BSSRRRTRE BRASS ogee see 2 | 
1 3 
2 El a) 500 1,000 : 
3 © 200 2 ‘ 500 
Eg ig dis i i 
4 #5 355 50 
cs) 20 
5 
: rt oy 0 20 40 60 80100 0 20 40 60 80 100 
— > 
2 7 § 
Bs 2 572162 
9 cs | 
2 10 : / J} \ 
= 11 — 1,000 
(s) __ a om 1,000. ”500 - 
12 ae ; BE : 
13 _ — Ee 200 —* grr SaaS = io DE oo 
50 
a 7 
1 — 0 20 40 60 80100 0 20 40 60 80 100 
3120 z 
22 2 644242 779828 
Y mo] 


-2 -1 O +1 +2 


tumor 
cover; age 

er 
388 

le | 
BR 

8 

f 


2) 

2e-) 
nw 
oo 
nu 
oo 


< 0 20 40 60 80 100 0 20 40 60 80 100 
ci 
x — de novo AML 7 
= 40 —— t-AML s 811184 
= 3 
= 
2) 
5 o 
3 20 Se "Si . 
’O 50 

= 0 ° 20 

0 20 40 60 80 Variable allele frequency (%) Variable allele frequency (%) 


Chr. 17 (MB) 


Extended Data Figure 1 | Whole-genome sequencing analysis of t-AML. orange or purple. Top, kernel density plots of the VAF data (green line) along 
a, Somatic copy number alterations in the 22 cases of t-AML. Blue indicates with the posterior predictive densities (grey line) from the mathematical model 
copy number loss; red indicates copy number gain. b, Representative clonality _ used to segregate clusters. c, Frequency of tier 1 silent, tier 2, and tier 3 

plots for 8 cases of t-AML are shown. Scatter plots (bottom) show variant mutations in 1 Mb increments across chromosome 17 in de novo AML and 
allele frequency and read depth in the tumour sample. Variant alleles in the t-AML. The TP53 genomic locus is identified. 

founding clone are depicted in green, while variants in subclones are depicted in 
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Extended Data Figure 2 | TP53 mutations are associated with decreased overall survival in t-AML/t-MDS. a, Overall survival in TP53 mutated (n = 13) and 
TP53 wild-type (n = 39) t-AML patients. b, Overall survival in TP53 mutated (n = 24) and TP53 wild-type (n = 35) t-MDS patients. 
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1. Additional 
Chemotherapy driver mutations 


Selection for 2. Clonal 
TP53 mutated expansion of 


clone leukemic clone 
Extended Data Figure 3 | Model of how cytotoxic therapy shapes clonal harbouring a TP53 mutation have a competitive advantage, resulting in 
evolution in t-AML/t-MDS. Age-related mutations in HSPCs result in the expansion of that clone. Subsequent acquisition of additional driver mutations 
production of a genetically heterogeneous population of HSPCs, including rare __ results in transformation to t-AML/t-MDS. Of note, the presence of TP53 
HSPCs with heterozygous TP53 mutations in some individuals. During mutations probably accounts for the high incidence of cytogenetic 
chemotherapy and/or radiotherapy for the primary cancer, HSPC clones abnormalities in t- AML/t-MDS and poor response to chemotherapy. 
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Extended Data Figure 4 | Validation of the unique adaptor sequencing 17: 7519119 T to A) ata VAF of ~37% was mixed with normal genomic DNA 
method. a, Unique adaptor sequencing approach. Step 1: genomic DNA is sample at the indicated ratio, and conventional (left) or unique adaptor 
amplified with TP53-specific primers (green) with subpopulation-specific next-generation sequencing (middle and right) was performed, as described in 
variant alleles highlighted in red. Step 2: randomly indexed adapters (tan Methods. DNA degradation with time may result in errors that are then 
and grey) are ligated to each amplicon. Step 3: the indexed amplicons are amplified during PCR, providing a source of false-positive calls. This is 
amplified to generate multiple reads possessing the same barcode (that is, read _ particularly true for C to A transversions. Since none of the TP53 mutations 
families). Step 4: after sequencing, reads are aligned and grouped by read analysed in this study were C to A transversions, we also analysed the data 
families to generate an error-corrected consensus sequence. Sequencing errors _ after removing C to A calls (right). The TP53 variant allele is circled in blue. 
(yellow) are randomly distributed amongst read families, while true variant c, The threshold of detection for the variant allele with each sequencing method 
alleles (red) are present in all members of a given read family. b, A tumour is shown. 


sample (UPN 895681) with a known TP53 somatic mutation (chromosome 
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Extended Data Figure 5 | Clonal evolution in case 314666. a, Clinical course diagnosis of t-MDS for the two clonal mutations present in the diagnostic 

of case 341666. Chemo, chemotherapy; DLBCL, diffuse large B-celllymphoma; _ t- MDS sample. Genomic DNA from a patient lacking these variants was used as 
XRT, radiotherapy. b, Unique adaptor sequencing was performed on acontrol. The blue circle indicates the position of the variant SNV. c, Proposed 
genomic DNA derived from leukapharesis samples obtained 3 years before the model of clonal evolution to t-MDS in this case. 
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Extended Data Figure 6 | ddPCR verification of selected somatic TP53 
mutations identified in peripheral blood of cancer-free individuals. 

a-c, ddPCR was performed on genomic DNA isolated from the peripheral 
blood of cancer-free individuals (middle) for whom unique-read adaptor 
sequencing suggested the presence of the indicated TP53 mutation. Controls 
represent peripheral blood DNA from cancer-free elderly individuals with 
VAFs not above background levels for the mutation of interest (right); the 
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negative control for TP53 Y220C is shown in Fig. 3b. a, The diagnostic t-AML 
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sample from patient 967645 was used as a positive control for TP53 Y220C. 
b, c, For TP53 V173M (b) and TP53 1195T (c) double-stranded genomic blocks 
(gBlocks) were synthesized containing the mutation of interest and mixed 
with gBlocks of wild-type sequence. Droplets containing only the variant TP53 
allele are highlighted in orange, droplets containing the wild-type TP53 allele 
(with or without the variant TP53 allele) are highlighted in blue; empty droplets 
are grey. The number of droplets in each gate is indicated. 
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Extended Data Table 1 | Clinical summary of the 22 t-AML whole-genome sequencing cases 


56.5 years (26-80) 
Male 36.4% 
Female 63.6% 


Breast 
Prior Disease Non-Hodgkin's Lymphoma 
Multiple Sclerosis 


Other 
Alkylator 
Known Previous Treatment Topoisomerase inhibitor 
Radiation 


Autologous Transplant 


3.2 years (0.9-13.3) 


Cytogenetics complex 
MLL rearrangement 
non-complex non-MLL 
79% (19-95%) 
Allogeneic transplant 
Most intensive t-AML/t-MDS Myeloablative 
treatment regimen Non-myeloablative 
Other/unknown 
Remisison Yes 50% 
No 40.9% 


Overal Survival Median 140.5 days (8-2000) 


Latency is defined as the time from the original cancer diagnosis to the development of t-AML/t-MDS. 
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Extended Data Table 2 | Clinical summary of the combined 111 t-AML/t-MDS cases 


Female 46.8% 

Breast 

Prior Disease Non-Hodgkin's Lymphoma 
Hodgkin's Disease 

Other 
Alkylator 55.9% 
Known Previous Treatment Topoisomerase inihbitor 50.5% 
Radiation 63.1% 


Autologous Transplant 21.6% 
Latency Median 6.25 years (0.4-40.7) 


MDS 53.2% 
deletion 5 26.1% 
Cytogenetics deletion 7 28.8% 
complex 45.0% 
MLL rearrangement 5.4% 
other/unknown 41.4% 
% Blasts in the bone marrow Median 13% (0-95%) 
Allogeneic transplant 


Most intensive AML 
treatment Myeloablative 
regimen Non-myeloablative 
Other/unknown 


Yes 49.5% 
No 43.2% 
Overal Survival 414 days (8-3831) 


Latency is defined as the time from the original cancer diagnosis to the development of t-AML/t-MDS. 
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Extended Data Table 3 | Previously banked tissue samples in patients with t-AML/t-MDS with clonal TP53 mutations 


TP53 mutation in t-AML/t-MDS Prior banked tissue sample 


Position Year of Year of Prior Banked 
Patient (Chr 17) Mutation Coding change Banking Diagnosis Tissue 
236041 7,518,261 R249W BM FFPE 


341666 7,518,988 R196* Pharesis 


Pharesis 


530447 7,519,238 K139N 
530447 7,518,263 R248Q Pharesis 
648904 7,514,759 Exon 9 splice site Pharesis 
756582 7,519,015 Exon 6 splice site Pharesis 
895681 7,519,119 H179L BM FFPE 
967645 7,518,915 Y220C BM Flow 


All patients had one or more clonal TP53 mutations in their diagnostic t-AML/t-MDS samples (530447 had biallelic mutations). Cases in which the previously banked sample had detectable 7P53 mutated cells are 
highlighted in red. See Supplementary Table 1 for the clinical and molecular features of these cases. BM FFPE, formalin-fixed parafin-embedded sample; BM flow, snap-frozen bone marrow leukocyte pellet. 
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Extended Data Table 4 | Somatic TP53 mutations in 19 cancer-free individuals 


Sample} Chr Exon Start 


7518230 


7518273 
7517849 


7517845 


7519138 
7519174 
7520035 
7517934 


7518990 
7518915 
7517819 
7518264 
7518264 


Stop 


7518230 


7518273 
7517849 


7517845 


7519138 
7519174 
7520035 
7517934 


7518990 
7518915 
7517819 
7518264 
7518264 


Ref Var 


4 


Cc 
Cc 
Cc 
Cc 
Cc 
A 
Cc 
A 
T 
G 
G 
G 


(7) 


POP OA @ Aaa A A AG 


Amino 
acid 
D259A 
G245S 
V272M 
R273H 


V173M 
A161T 


SPLICING COSM1522474 


INTRONIC 
1195T 
Y220C 
R282W 
R248G 
R248W 


COSMIC ID 


none 


COSM6932 
COSM10891 


COSM10660 


COSM11084 
COSM10739 


none 
COSM11089 
COSM10758 
COSM10704 
COSM11564 
COSM10656 


Var 


count 
13 


18 
26 


489 


177 
25 
23 
36 


57 
91 
51 


Total read 


family count 
33085 


41836 
81015 


420026 


182809 
164591 
165672 
333996 


15540 
316765 

86090 
218077 

51001 


VAF (read- 


family 
0.039% 


0.043% 
0.032% 


0.12% 


0.097% 
0.015% 
0.014% 
0.011% 


0.37% 
0.029% 
0.059% 

0.11% 

0.37% 
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VAF 


(ddPCR) 
N.D. 


N.D. 
N.D. 


N.D. 


0.081% 
N.D. 
N.D. 
N.D. 


0.28% 
0.029% 
N.D. 
N.D. 
N.D. 


Coverage statistics are as follows. In the amplicon targeting exon 4, 17/19 subjects had >10,000 coverage in 100% of the amplicon. In the amplicon targeting exon 5, 17/19 subjects had >10,000 coverage in 
100% of the amplicon. In the amplicon targeting exon 6, 5/19 subjects had > 10,000 coverage in 100% of the amplicon, and 11/19 subjects had >10,000 coverage in at least 75% of the amplicon. In the amplicon 
targeting exon 7, 17/19 subjects had >10,000 coverage in 100% of the amplicon. In the amplicon targeting exon 8, 18/19 subjects had >10,000 coverage in 100% of the amplicon. See Supplementary Table 11 
for the primers used to make the amplicons from genomic DNA. N.D., not determined. 


©2015 Macmillan Publishers Limited. All rights reserved 


1 sd Wal Be 


doi:10.1038/nature13994 


Enhancer—core-promoter specificity separates 
developmental and housekeeping gene regulation 


Muhammad A. Zabidi!*, Cosmas D. Arnold'*, Katharina Schernhuber', Michaela Pagani’, Martina Rath!, Olga Frank! 


& Alexander Stark! 


Gene transcription in animals involves the assembly of RNA poly- 
merase II at core promoters and its cell-type-specific activation by 
enhancers that can be located more distally’. However, how ubiqui- 
tous expression of housekeeping genes is achieved has been less clear. 
In particular, it is unknown whether ubiquitously active enhanc- 
ers exist and how developmental and housekeeping gene regulation 
is separated. An attractive hypothesis is that different core promo- 
ters might exhibit an intrinsic specificity to certain enhancers’ *. 
This is conceivable, as various core promoter sequence elements are 
differentially distributed between genes of different functions’, in- 
cluding elements that are predominantly found at either develop- 
mentally regulated or at housekeeping genes**°. Here we show that 
thousands of enhancers in Drosophila melanogaster S2 and ovarian 
somatic cells (OSCs) exhibit a marked specificity to one of two core 
promoters—one derived from a ubiquitously expressed ribosomal 
protein gene and another from a developmentally regulated tran- 
scription factor—and confirm the existence of these two classes for 
five additional core promoters from genes with diverse functions. 
Housekeeping enhancers are active across the two cell types, while 
developmental enhancers exhibit strong cell-type specificity. Both 
enhancer classes differ in their genomic distribution, the functions 
of neighbouring genes, and the core promoter elements of these 
neighbouring genes. In addition, we identify two transcription fac- 
tors—Dref and Trl—that bind and activate housekeeping versus 
developmental enhancers, respectively. Our results provide evidence 
for a sequence-encoded enhancer-core-promoter specificity that sep- 
arates developmental and housekeeping gene regulatory programs 
for thousands of enhancers and their target genes across the entire 
genome. 
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We chose the core promoter of Ribosomal protein gene 12 (RpS12) and 
a synthetic core promoter derived from the even skipped transcription 
factor" as representative ‘housekeeping’ and ‘developmental’ core pro- 
moters, respectively (hereafter termed hkCP and dCP; Fig. 1a and Ex- 
tended Data Figs 1, 2) and tested the ability of all candidate enhancers 
genome wide to activate transcription from these core promoters using 
self-transcribing active regulatory region sequencing (STARR-seq)’” in 
D. melanogaster S2 cells. This set-up allows the testing of all candidates 
in a defined sequence environment, which differs only in the core pro- 
moter sequences but is otherwise constant’*”’. 

Two hkCP STARR-seq replicates were highly similar (genome-wide 
Pearson correlation coefficient (PCC) 0.98; Extended Data Fig. 1c) and 
yielded 5,956 enhancers, compared with 5,408 enhancers obtained when 
we reanalysed dCP STARR-seq data’* (Supplementary Table 1). Inter- 
estingly, the hkCP and dCP enhancers were largely non-overlapping 
(Fig. 1b, c) and the genome-wide enhancer activity profiles differed 
(PCC 0.38), as did the individual enhancer strengths: of the 11,364 en- 
hancers, 8,144 (72%) activated one core promoter at least twofold more 
strongly than the other, a difference rarely seen in the replicate experi- 
ments for each of the core promoters (Fig. 1d). Indeed, 21 out of 24 
hkCP-specific enhancers activated luciferase expression (>1.5-fold 
and t-test P< 0.05) from the hkCP versus 1 out of 24 from the dCP 
(Fig. le and Extended Data Fig. 3). Consistently, 10 out of 12 dCP- 
specific enhancers were positive with the dCP but only 2 out of 12 with 
the hkCP, a highly significant difference (P = 5.1 X 10° °, Fischer’s exact 
test) that confirms the enhancer-core-promoter specificity observed 
for thousands of enhancers across the entire genome. 

Enhancers that were specific to either the hkCP or the dCP showed 
markedly different genomic distributions (Fig. 2a and Extended Data 


Figure 1 | Distinct sets of enhancers activate 
transcription from the hkCP and dCP in S2 cells. 
a, STARR-seq set-up using the hkCP housekeeping 
(RpS12; purple) and dCP developmental core 
promoters (Drosophila synthetic core promoter 
(DSCP)"'; brown) b, Genome browser screenshot 
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Figure 2 | hkCP and dCP enhancers differ in genomic distribution and 
flanking genes. a, Genomic distribution of hkCP and dCP enhancers. CDS, 
coding sequence; UTR, untranslated region. b, c, hkCP enhancers function 
distally in luciferase assays independent of their genomic positions (b) and 
orientation towards the luciferase TSS (c; orientation 1 from b; Extended 
Data Figs 3 and 5). d, e, GO (5 of the top 100 terms shown per column; 
Supplementary Table 11) and gene expression (terms curated from the 
Berkeley Drosophila Genome Project (BDGP) and FlyAtlas) analyses (d) and 
enrichment of core promoter elements at TSSs (e) for genes next to hkCP 
and dCP enhancers. TF, transcription factor. 


Fig. 4): whereas the majority (58.4%) of hkCP-specific enhancers over- 
lapped with a transcription start site (TSS) or were proximal to a TSS 
(=200 bp upstream; Fig. 2a), dCP-specific enhancers located predom- 
inantly to introns (56.5%) and intergenic regions (26.9%; Fig. 2a)’. 
Importantly, despite the TSS-proximal location of most hkCP-specific 
enhancers, they activated transcription from a distal core promoter in 
STARR-seq (Fig. 1a and Extended Data Figs 1a, 2). Luciferase assays 
confirmed that they function from a distal position (>2 kb from the TSS) 
downstream of the luciferase gene and independently of their orienta- 
tion towards the luciferase TSS (Fig. 2b, cand Extended Data Figs 3, 5). 
These results show that TSS-proximal sequences can act as bona fide 
enhancers“ and that developmental and housekeeping genes are both 
regulated through core promoters and enhancers, yet with a substan- 
tially different fraction of TSS-proximal enhancers (3.4% versus 58.4%). 

hkCP and dCP enhancers were also located next to functionally dis- 
tinct classes of genes according to gene ontology (GO) analyses: genes 
next to hkCP enhancers were enriched in diverse housekeeping functions 
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including metabolism, RNA processing and the cell cycle, whereas genes 
next to dCP enhancers were enriched for terms associated with devel- 
opmental regulation and cell-type-specific functions (Fig. 2d, Extended 
Data Fig. 6a and Supplementary Tables 2-4). Consistently, hkCP en- 
hancers were preferentially near ubiquitously expressed genes and dCP 
enhancers were near genes with tissue-specific expression (Fig. 2d and 
Supplementary Table 5). 

The core promoters of the putative endogenous target genes of hkCP 
and dCP enhancers were also differentially enriched in known core pro- 
moter elements’ (Fig. 2e and Extended Data Fig. 6b): TSSs next to hkCP 
enhancers were enriched in Ohler motifs!® 1, 5,6 and 7, consistent with 
the ubiquitous expression and housekeeping functions of these genes. 
In contrast, TSSs next to dCP enhancers were enriched in TATA box, 
initiator (Inr), motif ten element (MTE) and downstream promoter 
element (DPE) motifs, which are associated with cell-type-specific gene 
expression”””. 

We next investigated whether the specificity that hkCP and dCP show 
to the two enhancer classes applies more generally. We selected three 
additional core promoters from housekeeping genes with different func- 
tions: from the eukaryotic translation elongation factor 16 (eEF16), the 
putative splicing factor x16, and the cohesin loader Nipped-B (NipB). 
Importantly, all three contained combinations of core promoter 
elements that differed from that of hkCP, namely TCT? and DNA- 
replication-related element (DRE) motifs (eEF10), and Ohler motifs 1 
and 6 (x16 and NipB; Fig. 3a). In addition, we selected a DPE-containing 
core promoter of the transcription factor pannier (pnr) and the TATA- 
box core promoter of Heat shock protein 70 (Hsp70), which can be 
activated by tissue-specific enhancers (for example, see ref. 17), thus 
covering the two most prominent core promoter types of regulated 
genes?'°"8, 

We performed STARR-seq for the five additional core promoters and 
grouped the genome-wide enhancer activity profiles of all seven core 
promoters by hierarchical clustering. This revealed two distinct clus- 
ters corresponding to the four housekeeping and the three develop- 
mental core promoters, respectively (Fig. 3b, Extended Data Fig. 7 and 
Supplementary Tables 6,7), and the core promoters of both clusters in- 
deed responded markedly differentially to individual genomic enhan- 
cers (Fig. 3c). 

These results obtained for core promoters with diverse motif content 
and from genes with various functions suggest that the distinct enhan- 
cer preferences observed between hkCP and dCP apply more generally 
and that two broad classes of housekeeping and developmental (or 
regulated) core promoters exist. Differences within each class might 
correspond to differences in relative enhancer preferences of the core 
promoters” °, while similarities between both classes could reflect en- 
hancers that are shared (Fig. 1c—e) or core promoters that can be acti- 
vated to different extents by enhancers from both classes (for example, 


Figure 3 | Housekeeping and developmental 
core promoters differ characteristically in their 
enhancer preferences. a, Different housekeeping 
(top 4) and developmental-like (bottom 3) core 
promoters and their motif content (schematic). 
b, Bi-clustered heat map depicting pairwise 
similarities of STARR-seq signals (PCCs at peak 
summits). PCCs and dendrogram (top) show 
the separation between housekeeping and 
regulated core promoters. c, Genome browser 
screenshot depicting STARR-seq tracks for 

all seven core promoters. 
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housekeeping genes need to be further activated in specific tissues. 


To test whether hkCP enhancers function in different cell types, we 
performed STARR-seq using hkCP in OSCs, which differ strongly from 
S2 cells in gene expression and dCP enhancer activities’*. Two hkCP 
STARR-seq replicates in OSCs were highly similar (PCC 0.97) and yielded 
6,217 enhancers (Supplementary Table 1), compared with 5,774 en- 
hancers obtained for dCP data from OSCs'*. The OSC data confirmed 
the differences between hkCP and dCP enhancers observed in S2 cells 
(Extended Data Figs 8, 9 and Supplementary Tables 8-10). Strikingly, 
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Figure 4 | hkCP enhancers are shared across cell 
types. a, Genome browser screenshot showing 
tracks for hkCP (top) and dCP STARR-seq 
(bottom) in $2 cells and OSCs. b, Overlap of hkCP 
(top) and dCP (bottom) enhancers between S2 cells 
and OSCs. c, d, hkCP (c) and dCP (d) STARR-seq 
enrichments in S2 cells versus OSCs at hkCP- or 
dCP-specific enhancers (insets show enrichments 
for replicates (Enr. rep) 1 versus 2; dCP data 
reanalysed from ref. 12). 
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cers (2,909 in OSCs and 3,586 in S2 cells) differed strongly between the 
two cell types’? and from the hkCP enhancers (Fig. 4a). The observation 
that hkCP enhancers showed similar activities in both cell types while 
dCP enhancers were cell-type specific was true genome wide when 
comparing genomic locations (69% versus 15% overlap) or enhancer 
strengths as measured by STARR-seq (PCC at peak summits 0.83 ver- 
sus 0.05; Fig. 4b-d and Extended Data Fig. 9c). Together, these results 
show that hkCP enhancers are shared between two different cell types, 


Figure 5 | hkCP and dCP enhancers depend on 
Dref and Trl, respectively. a, b, Motif enrichment 
(a) and ChIP signals for Dref and Trl (b) in hkCP 
and dCP enhancers. False discovery rate (FDR)- 
corrected hypergeometric P > 0.01; boxes: median 
and interquartile range; whiskers: 5th and 95th 
percentiles; two-sided Wilcoxon-rank-sum P 
values. NS, not significant. c, Luciferase assays for 
four wild-type and DRE-motif-mutant hkCP 
enhancers (numbers show mutated motifs). Error 
bars show standard deviation (s.d.) (n = 3, 
biological replicates). *P < 0.005 (one-sided t-test). 
d, Luciferase assays for two dCP enhancers (—) and 
their GAGA — DRE-mutant variants (+) with 
hkCP (top) and dCP (bottom; details as in c). 

e, Luciferase assays for an array of DRE motifs with 
hkCP and dCP (details as in c). f, Model: 
housekeeping genes contain Ohler motifs 1, 5, 6, 7 
and/or the TCT motif and are activated by TSS- 
proximal hkCP enhancers via Dref. Regulated 
genes contain TATA box, Inr, MTE and/or DPE 
and are activated by distal dCP enhancers via Trl. 


whereas dCP enhancers are cell-type specific’, presumably represent- 
ing ubiquitous housekeeping versus developmental and cell-type- 
specific gene expression programs. 

Toassess whether the marked core promoter specificities of the hkCP 
and dCP enhancers are encoded in their sequences, we analysed the cis- 
regulatory motif content of both classes of enhancers’’. This revealed a 
strong enrichment of the DRE motif in hkCP enhancers (Fig. 5a and 
Supplementary Tables 11, 12), whereas dCP enhancers were strongly 
enriched in the GAGA motif of Trithorax-like (Trl) and other motifs 
previously described to be important for dCP enhancers”. Published 
genome-wide chromatin immunoprecipitation (ChIP) data*’** con- 
firmed that DRE-binding factor (Dref) bound significantly more strongly 
to hkCP enhancers than to dCP enhancers (Wilcoxon P = 0; Fig. 5b), 
while the opposite was true for Trl (Wilcoxon P = 6.2 X 1071”). Consid- 
ering only distal enhancers (>500 bp from the closest TSS) yielded the 
same results (Extended Data Fig. 10a, b and Supplementary Tables 13, 14), 
suggesting that the differential occupancy is a property of both classes 
of enhancers rather than a consequence of the different extents to which 
they overlap with TSSs. Disrupting the DRE motifs in four different 
hkCP enhancers substantially reduced the activities of the enhancers 
as measured by luciferase assays in S2 cells (between 2.3- and 24.5-fold 
reduction; Fig. 5c), while dCP enhancers depend on GAGA motifs”. 
Adding DRE motifs to 11 different dCP enhancers significantly increased 
luciferase expression from the hkCP for 9 of them (82%; Extended Data 
Fig. 10c), and changing the GAGA motifs of two dCP enhancers to DRE 
motifs significantly increased the activities of both enhancers towards 
the hkCP but decreased their activities towards the dCP (Fig. 5d). Fur- 
thermore, an array of six DRE motifs was sufficient to activate lucifer- 
ase expression from the hkCP but not the dCP (Fig. 5e). Together, these 
results show that hkCP and dCP enhancers depend on DRE and GAGA 
motifs, respectively, and demonstrate that DRE motifs are required and 
sufficient for hkCP enhancer function. 

Our results show that developmental and housekeeping gene regu- 
lation is separated genome wide by sequence-encoded specificities of 
thousands of enhancers to one of two types of core promoter, supporting 
the longstanding ‘enhancer—core-promoter specificity’ hypothesis” °”’. 
Our findings indicate that these specificities are probably mediated by 
defined biochemical compatibilities* between different trans-acting 
factors such as Dref versus Trl (at enhancers) and the different para- 
logues that exist for several components of the general transcription 
apparatus (at core promoters), presumably including the TAT A-box- 
binding protein-related factor 2 (Trf2) at housekeeping core promoters””*. 
As such paralogues can have tissue-specific expression and stage-specific 
or promoter-selective functions’””* (reviewed in refs 29, 30), sequence- 
encoded enhancer-core-promoter specificities could be used more widely 
to define and separate different transcriptional programs (Fig. 5f). 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 

hkCP STARR-seq vector. We derived the hkCP STARR-seq vector from the orig- 
inal STARR-seq vector” by replacing the DSCP sequence with the sequence of the 
RpS12 core promoter (—50 to +50 bp relative to the TSS; TTGTACCAATAGCT 
AAAAACTCACATCTCCAGCGCCATGCCGATTTTGTTCTCTTTCTTTCCG 
GTTGTCAAAAGGTACAGATGCTTGGATTTTATTTCTC). The STARR-seq 
vectors are available subject to a material transfer agreement (MTA). For both 
STARR-seq vectors, we confirmed that transcription initiates from within the re- 
spective core promoters’ Inr (DSCP) and TCT (RpS12) motifs by 5’ rapid amplifi- 
cation of cDNA ends (RACE; Extended Data Fig. 2). All other STARR-seq vectors 
were derived from the hkCP STARR-seq vector by replacing the 100 bp sequence 
encompassing the RpS12 core promoter by the sequences indicated in Supplemen- 
tary Table 15 using the BglII and Sbfl restriction sites. 

hkCP and dCP luciferase vectors. For the dCP luciferase vector, the SV40 pro- 
moter of the pGL3-Promoter Vector (Promega) was replaced by the DSCP”' anda 
Gateway cassette was inserted downstream of the luciferase gene and the SV40 
polyA-signal into the Afel restriction site, to allow Gateway LR cloning of candi- 
date sequences’’. For the hkCP luciferase vector, the SV40 promoter and the se- 
quence until the translation start codon of the luciferase gene was replaced by the 
sequence encompassing the TSS of RpS12 from —50 bp until its translation start 
codon: TTGTACCAATAGCTAAAAACTCACATCTCCAGCGCCATGCCGA 
TTTTGTTCTCTTTCTTTCCGGTTGTCAAAAGGTACAGATGCTTGGATTT 
TATTTCTCCGAAATGAAGAGGTTTTCTTATCGAAAATGTAATAAATATG 
AACAATTAACTATCTTTTCCAGTGCAGTGCATCCTTAACCGCAGAACA. 
Constructs are available subject to an MTA. 

Intrinsic activity of core promoters. All core promoters used in this study were 
cloned into the dCP luciferase vector (without the Gateway cassette), replacing the 
DSCP between the BglII and Sbfl restriction site with the respective core promoter. 
For each core promoter, the intrinsic (or basal) activity was measured as firefly lu- 
ciferase activity and is presented as relative luciferase units, normalized to Renilla 
luciferase signals. 

Genome-wide STARR-seq screens. STARR-seq enhancer screens using the core 
promoters of RpS12 (hkCP), NipB, x16, and eEF16 (Supplementary Table 15) were 
performed in two biological replicates (independent transfections) as described 
previously”? with the following exceptions. First, 1.6 X 10” S2 cells and OSCs*" were 
transfected per biological replicate. Second, first-strand cDNA synthesis was per- 
formed in 30-60 reactions with the STARR-seq RT primer (CTCATCAATGTAT 
CTTATCATGTCTG)as reverse transcription primer. Last, next-generation sequen- 
cing (NGS) was performed on an Illumina HiSeq 2000 machine using multiplexing 
according to the manufacturer’s instructions. STARR-seq data using the DSCP (dCP 
STARR-seq) and Hsp70 core promoters are from ref. 12, but were reanalysed using 
the same pipeline as for hkCP STARR-seq. 

Focused STARR-seq BAC screens. The DSCP is a 137-nucleotide-long synthetic 
core promoter derived from the core promoter of even skipped (eve)''. To assess 
the functional similarity of the DSCP, its 137-nucleotide-long wild-type counter- 
part from the eve locus, anda version defined identically to all other core promoter 
used here (—50 to +50 nucleotides around the TSS), we performed STARR-seq 
screens with libraries derived from 29 different BACs containing a total of ~5 Mb 
of D. melanogaster genomic DNA (Supplementary Table 16). For comparison, we 
also screened all other core promoters with this library. For library cloning, all BACs 
were grown in individual bacterial cultures and were then mixed equally according 
to measurements of their optical density at 600 nm (OD¢00 nm) before BAC DNA 
isolation to achieve an equal distribution of all BACs. BAC DNA extraction, soni- 
cation and adaptor ligation was performed as described’* and the same adaptor- 
ligated and PCR-amplified BAC DNA was used to clone all focused STARR-seq 
libraries. Per STARR-seq vector, four In-Fusion reactions were performed, which 
allowed five transformation reactions as described’’. Each library was grown in 41 
liquid culture (LB medium) to an OD¢09 nm Of 2.0-2.5. Each BAC library was screened 
as described earlier for the genome-wide screens; however, only 1 X 10° S2 cells were 
used, accounting for the less complex library. Similarly, the number of reactions for 
all subsequent steps of the STARR-seq protocol was reduced fourfold. 
Luciferase reporter assays. Luciferase assays were performed as described prev- 
iously’” with the exception that the candidate enhancers were cloned downstream 
of the luciferase gene and the polyA signal, more than 2 kb away from the respect- 
ive core promoter (RpS12 or DSCP). Candidate enhancers were selected manually 
based on different criteria to allow the systematic assessment of several aspects of 
this study, including enhancers that were (1) specific to one of the two different 
core promoters (24 hkCP and 12 dCP enhancers) or found in both screens (7 shared 
enhancers); (2) located proximally (17) or distally (7) to the hkCP; and (3) of 
different strengths according to STARR-seq (ranks 18 to 1,044). We cloned all 
candidates as described’? (for their genomic coordinates and primer sequences see 
Supplementary Table 17), picking initially one orientation towards the luciferase 
TSS randomly. However, to test the influence of TSSs contained in the candidate 


sequences, we cloned and tested all TSS proximal candidates (hkCP_01 to hkCP_ 
17) in both orientations using both core promoters. Candidate enhancers with 
DRE mutations were cloned from synthesized DNA fragments (GeneArt Strings; 
Supplementary Table 18). Candidates with DRE motifs that replace GAGA motifs 
were cloned similarly using synthesized DNA fragments (gBlocks) obtained from 
Integrated DNA Technologies (Supplementary Table 19). We also added an array 
of 6X DRE motifs into the Afel restriction site of the dCP and hkCP luciferase 
vectors and cloned dCP_01 to dCP_11 into the middle of the DRE motif array 
(using AfeI) of the hkCP luciferase vector, such that these sequences were each 
flanked by three DRE motifs (Supplementary Table 19). 

Luciferase assay data analysis. For all luciferase assays, we calculated standard 
deviations and one-sided Student's t-tests from three biological replicates (indepen- 
dent transfections). Core promoters have intrinsic (basal) activities that can differ 
between different core promoters. Therefore, when comparing enhancer activities 
for different core promoters, normalization to the core promoters’ intrinsic activ- 
ities is required, which we assessed with three different negative control fragments 
(nine biological replicates in total). For all measurements, we normalized firefly 
luciferase values first to Renilla luciferase values (controlling for transfection effi- 
ciency) and then to the normalized luciferase values of the three negative control 
sequences. Candidates with a significant (P< 0.05) enrichment greater than 1.5 
fold over negative were considered positive. 

5’ RACE of STARR-seq transcripts. To determine the exact TSSs of hkCP and 
dCP within the STARR-seq vectors we performed 5’ RACE of STARR-seq tran- 
scripts using one enhancer for each (an intergenic enhancer of TpnC41C for hkCP 
and an intronic enhancer of zfh1 (shared_01 from ref. 12) for dCP) which we 
cloned with EcoRV at the position of the selection cassette used during library clon- 
ing (Supplementary Table 20). We transfected 3.2 X 10’ cells with each of the con- 
structs and isolated total RNA using the RNeasy mini prep kit (Qiagen; two columns 
per construct) followed by polyA+ RNA isolation using oligo-dT Dynabeads (Life 
Technologies) according to the manufacturer’s instructions. We then performed 
5' RACE for both samples using the FirstChoice RLM-RACE Kit (Ambion; cata- 
logue no. AM1700) according to the manufacturer’s instructions. To reflect RNA 
processing of the STARR-seq pipeline, reverse transcription was, however, performed 
using SuperscriptII] (Invitrogen) according to the manufacturer’s instructions and 
using the reverse transcription primer GFP-RT (Supplementary Table 20) as a 
gene-specific primer (using RNA amounts according to the FirstChoice manual). 
The first PCR was performed with the manufacturer-provided 5’ RACE Outer Pri- 
mer and the transcript-specific primer RACE-01-rv, using 2X KAPA Hifi Hot Start 
Ready Mix (98 °C for 45 s; followed by 35 cycles of 98 °C for 15s, 69 °C for 30s, 
72°C for 30s) with 1 pil of cDNA as template. The nested PCR was performed 
similarly (primer: 5’ RACE Inner Primer and RACE-02-rv; 98 °C for 45 s; followed 
by 30 cycles of 98 °C for 15 s, 67 °C for 30 s, 72 °C for 10s). The PCR products were 
visualized on a 1% agarose gel. The PCR products for both samples were Sanger 
sequenced using the primer GFP-seq-rv (for all primer sequences see Supplemen- 
tary Table 20). 

STARR-seq NGS data processing. Paired-end STARR-seq and input read pro- 
cessing was performed as described*’. The NGS data for dCP (DSCP) and Hsp70 
were obtained from ref. 12 and reanalysed. In the same cell line, a hkCP peak is 
considered to be ‘specific if the 501 bp window centred at the peak summit does 
not overlap with any such window for dCP peaks, and vice versa (note that this is 
only applied within each cell type, such that comparisons across cell types are not 
influenced). For screens with the BAC-derived libraries, we considered only frag- 
ments that originated from the BACs used and determined the relative abundance 
of each BAC from the NGS data of the respective inputs only. On the basis of this, 
we then adjusted both inputs and STARR-seq NGS data such that all BACs were 
equally represented and analysed the data as described earlier. 

Venn diagrams and peak intersection. We used the same intersection method as 
described earlier, and plotted the Venn diagrams with areas proportional to the 
number of peaks. 

Scatter plots. We calculated the STARR-seq enrichment over input at the summit 
positions of both data sets that were to be compared, using a pseudo count of 1, and 
computed the log, of corrected ratio as described’’. This plots one data point for 
each enhancer—even for closely spaced ones—exactly at the enhancer’s summit 
position. For visualizing replicates, we called peaks on the merged data sets and 
plotted the values from both replicates at these peaks’ summits. 
Enhancer-to-gene assignment. We performed three different strategies of enhancer- 
to-gene assignments: (1) “closest TSS’, whereby an enhancer is assigned to the closest 
TSS of an annotated transcript; (2) ‘1 kb TSS’, whereby an enhancer is assigned to 
all TSSs that are within 1 kb; and (3) ‘gene loci’, whereby an enhancer is assigned 
to a gene provided that it falls within 5 kb upstream from the TSS, within the gene 
body itself, or 2 kb downstream of the gene (multiple assigned genes are possible). 
In all cases we used annotation from D. melanogaster FlyBase release 5.50. 
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Genomic distribution. We assigned a unique annotation for each nucleotide in the 
genome by using the following priority order: coding sequence (CDS), core pro- 
moter (+50 bp around TSS), 5’ UTR, 3’ UTR, first intron, intron, proximal pro- 
moter (200 bp upstream of a TSS), intergenic region. We then assigned each peak 
to one of these categories by the annotation of the peak’s summit. 

GO analysis. We assessed whether genes assigned to hkCP or dCP enhancers were 
enriched for particular GO categories”* by calculating hypergeometric P values for 
all categories, which we corrected for multiple comparisons (FDR-type correction 
in R). We then sorted all categories according to P values of overrepresentation, 
selected the top 100 of either hkCP or dCP, and removed redundant categories 
manually. For each category, we calculated log)o(P-value underrepresentation) — 
log)o(P-value overrepresentation), and sorted the terms in a descending order of 
difference between hkCP and dCP values. The colour intensity of the heat maps rep- 
resents logio(P-value underrepresentation) — log)o(P-value overrepresentation). 
Gene expression analysis. We analysed enrichment in ubiquitous versus tissue- 
specific gene expression sets as described for the GO analysis above. To define the 
gene sets based on an in situ hybridization data set of fly embryos (BDGP”*), we first 
removed maternal (stages 1 to 3) annotations, as well as genes with the annotation 
‘no staining’ in all stages. We required each gene to have annotations for at least 
three stage groupings. We called a gene ‘tissue specific ifat most one of these anno- 
tations contains the word ‘ubiquitous’, and called it ‘ubiquitous’ if at least 60% of 
them contain word ‘ubiquitous’. We also defined gene sets based on microarray 
data sets from dissected fly tissues (FlyAtlas*’). We defined genes as ‘ubiquitous’ if 
their expression does not change more than twofold compared with the whole fly 
for at least 15 out of 23 tissues. For this, we used the ratios and ‘change_direction’ 
calls from FlyAtlas directly and did not consider cell lines and carcasses. We sim- 
ilarly defined genes to be ‘tissue specific if they change more than twofold in at least 
three tissues. We do not consider genes with multiple conflicting entries as they can 
result from the use of multiple probes and removed genes that overlapped between 
the ‘ubiquitous’ and ‘tissue-specific’ gene sets from both sets. 

Transcription factor motif and core promoter element enrichment analysis. 
Weused previously employed position weight matrices (PW Ms) for different tran- 
scription factors! with a cut-off of 4° = 2.4 X 10 *. We selected random control 
regions by controlling for genomic and chromosome distribution, and required 
that they did not overlap with any peak. We scored each motif for its enrichment in 
401 bp windows centred on the peak summits by multiple testing (FDR) corrected 
hypergeometric P values. We considered only motifs that showed log>(confidence 
ratio of motif counts in peak windows/motif counts in random control regions) > 1 
and P value < 0.01 in hkCP or dCP enhancers (or both) and reduced motif redun- 
dancy by removing highly similar motifs as in ref. 13 and references therein. We 
sorted the motifs in a descending order by difference in logs(hkCP enrichment) — 
logo(dCP enrichment). When assessing whether the observed motif distribution 
persisted for distal enhancers (Extended Data Fig. 10a), we kept the motifs and 
their order as in Fig. 5a and only re-evaluated their enrichment in distal enhancers. 
The colour intensity of the heat maps represents log,(confidence ratio of motif 
counts in peak windows/motif counts in random control regions). We used previ- 
ously published PWMs or created PWMs from published nucleotide counts for 
TATA box, Inr, MTE, DPE and Ohler motifs!* 1, 5, 6, 7 and the TCT motif* restricted 
to 8 bp. We scanned for motif occurrences using MAST from the MEME suite’® 
(version 4.9.0) and parameters that ensured specificity and sensitivity for each motif 
(Supplementary Table 21). For enhancer-to-gene assignment methods 1 and 2 
described earlier, we determined the presence of each core promoter element in the 
core promoter region of all genes uniquely assigned to either hkCP or dCP enhan- 
cers, respectively. For assignment method 3, we took the core promoter elements of 
the TSSs of the longest messenger RNA isoform. We assessed the differential 
distribution of each core promoter element between the core promoters assigned 
to hkCP or dCP enhancers by confidence ratios and hypergeometric P values. 
Transcription factor motif and core promoter element de novo discovery. We 
used MEME® (version 4.9.0) to discover de novo motifs with lengths between 5 
and 8 nucleotides in the enhancer regions we identified using STARR-seq and in 
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the core promoter regions around the nearest annotated transcription TSS. We pro- 
vide all discovered motifs in Supplementary Table 22. 

Core promoter similarity heat map. For all pairs of core promoters, we computed 
pair-wise PCCs between the respective STARR-seq fragment coverages at the sum- 
mits of all peaks called in either of the two screens genome wide. We performed 
hierarchical clustering (complete linkage) in R, directly using the computed PCC 
values as similarities. 

STARR-seq enrichment heat map. We computed the log, of the corrected STARR- 
seq enrichment over input as described earlier, but for each nucleotide in a 20 kb 
window around all reference peak summit positions, and down-sampled the data 
points 50-fold by calculating one average data point per 50 nucleotides. 
STARR-seq enrichment meta-profiles around TSSs. We calculated corrected 
STARR-seq enrichments (log,) as for the heat maps, but for 20 kb windows around 
TSSs, selected according to their core promoter motif content (see Extended Data 
Figs 4 and 8), corrected for the orientation of the TSSs within the genomic se- 
quence. We then calculated the average for each position along the x-axis. 
Boxplot. We obtained Dref ChIP-seq and input data (from Kc167 cells) from ref. 21 
(Gene Expression Omnibus accession numbers GSM977024 and GSM762849) 
and mapped the 36-nucleotide reads using bowtie’ (version 0.12.9) with the fol- 
lowing parameters: -p 4 -q -v 3 -m 1 --best --strata --quiet. We extended the reads 
to 150 bp, calculated the coverage for ChIP-seq and input at the STARR-seq peak 
summit, normalized the value to the number of input fragments, added a pseudo 
count of 1, and computed the confidence ratio of ChIP-seq over input. For the Trl 
ChIP-chip data obtained from ref. 22, we used the signal of the chip-array probe at 
the peak summit if available or inferred the signal by linear extrapolation from the 
two nearest flanking probes (one on each side) provided that they were both within 
10 nucleotides of the peak summit. We calculated statistical significance via 
Wilcoxon’s paired rank tests. 

Coordinate intersections. We performed genomic coordinate intersections using 
the BEDTools suite** (version 2.17.0). 

Statistics. We performed all statistical calculations and created graphical displays 
with R”. 


31. Saito, K. et al. A regulatory circuit for piwi by the large Maf gene traffic jam in 
Drosophila. Nature 461, 1296-1299 (2009). 

32. Arnold, C. D. et a/. Quantitative genome-wide enhancer activity maps for five 
Drosophila species show functional enhancer conservation and turnover during 
cis-regulatory evolution. Nature Genet. 46, 685-692 (2014). 

33. Ashburner, M. etal. Gene ontology: tool for the unification of biology. Nature Genet. 
25, 25-29 (2000). 

34. Tomancak, P. eta/. Global analysis of patterns of gene expression during Drosophila 
embryogenesis. Genome Biol. 8, R145 (2007). 

35. Chintapalli, V. R., Wang, J. & Dow, J. A. T. Using FlyAtlas to identify better Drosophila 
melanogaster models of human disease. Nature Genet. 39, 715-720 (2007). 

36. Bailey, T. L. & Gribskov, M. Combining evidence using p-values: application to 
sequence homology searches. Bioinformatics 14, 48-54 (1998). 

37. Langmead, B., Trapnell,C., Pop, M. & Salzberg, S. L. Ultrafast and memory-efficient 
alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 
(2009). 

38. Quinlan, A. R. & Hall, |. M. BEDTools: a flexible suite of utilities for comparing 
genomic features. Bioinformatics 26, 841-842 (2010). 

39. R Development Core Team. R: A Language and Environment for Statistical 
Computing (R Foundation for Statistical Computing, 2010). 

40. Zeitlinger, J. & Stark, A. Developmental gene regulation in the era of genomics. Dev. 

Biol. 339, 230-239 (2010). 

41. Kvon, E. Z. et al. Genome-scale functional characterization of Drosophila 

developmental enhancers in vivo. Nature 512, 91-95 (2014). 

42. Soler, E. et al. The genome-wide dynamics of the binding of Ldb1 complexes 

during erythroid differentiation. Genes Dev. 24, 277-289 (2010). 

43. Chen, K. etal. A global change in RNA polymerase II pausing during the Drosophila 

midblastula transition. eLife 2, e€00861 (2013). 

44, Lagha, M. et al. Paused Pol II coordinates tissue morphogenesis in the Drosophila 

embryo. Cell 153, 976-987 (2013). 

45. Kwak, H., Fuda, N. J., Core, L. J. & Lis, J. T. Precise maps of RNA polymerase reveal 
how promoters direct initiation and pausing. Science 339, 950-953 (2013). 


©2015 Macmillan Publishers Limited. All rights reserved 


LETTER 


a Candidate fragment is an enhancer —> detected 
aK 
so== ORF PA site = ----- 
STARR-seq construct > Enhancer pe 
. 
* Enhancer 
STARR-Seq transcript ORF AAAAA 


= RT (STARR-seq specific) 


ee 
—>——— <= 


PCR 1 


PCR 2 (sequencing ready) 


Candidate fragment is a core promoter —> not detected 


= ORF [emp so} ---- 
STARR-seq construct ore 
promoter 
STARR-seq transcript me —— AAAAA 
-————— FT (STARR-seq specific) 
as eee eeeenenseeseeeeneeeaneeeaes a PCR 1 
Mapes eeeeeee = PCR 2 (sequencing ready) >< 


= RT or PCR primers >< No PCR product due to missing primer binding site 


b c 
64 hkCP S2 STARR-seq 6 dCP S2 STARR-seq 
60 
2 
50 a | a. ysl 
4 a 4 a 4 
8 40 2 2 
5 @ 24 224 
5 g i © 
2 = L + 
g 20 iT 04 a 04 * 
© 
@ 10 
o PCC = 0.98 PCC = 0.85 
0 20 r T T 20 r T T T 
© ‘ 
EE + Se eg -2 0 2 4 6 2 0 2 4 6 
@ 
gf x & Enr. rep1 (log,) Enr. rep (log,) 
& . 


Extended Data Figure 1 | Set-up of STARR-seq with different core 
promoters. a, STARR-seq detects enhancers but no promoters (reproduced 
with permission from ref. 12). Left, STARR-seq couples the enhancer activities 
of candidate fragments to the sequences of the candidates in cis by placing 
the candidates to a position within the reporter transcript. Enhancer activities 
can therefore be assessed by the presence of candidates among cellular 
messenger RNAs, which allows the parallel assessment of millions of 
candidates, enabling genome-wide screens. Sequences that activate 
transcription from the intended core promoter of the STARR-seq vector lead to 
a full-length reporter transcript and can be detected by STARR-seq. Shown are 
the reverse transcription (RT) and nested polymerase chain reaction (PCR) 
steps of the STARR-seq reporter RNA processing protocol that ensure this. 
Right, in contrast, STARR-seq does not detect truncated transcripts that result 
if a candidate fragment functions as a promoter to initiate transcription. 
Thus, core-promoter-containing (that is, TSS-overlapping) sequences that are 
detected by STARR-seq exhibit enhancer activity as they can activate 
transcription from a remote position, in addition to their ability to serve as core 


promoters endogenously’. b, Luciferase signals (firefly/Renilla) assessing the 
intrinsic (or basal) activity of the core promoters used in this study. The 
luciferase reporter constructs do not contain any enhancer and differ only in 
the respective core promoter sequences. The basal activities differ as expected, 
but do not differ consistently between housekeeping (RpS12, eEF16, NipB, 
x16) and developmental (DSCP, eve (long), eve and pnr) core promoters, nor 
between core promoters for which the STARR-seq screens appear most similar 
(for example, RpS12 and eEF10; see Fig. 3). Note that all luciferase assays 

and STARR-seq screens are corrected for differences in intrinsic activity. 

c, Reproducibility of hkCP and dCP STARR-seq in D. melanogaster S2 cells. 
The reproducibility of hkCP and dCP STARR-seq as assessed by the STARR- 
seq enrichments (replicate 1 versus 2) at the summits of enhancer peaks 
called in the merged experiments (hkCP: 5,956; dCP: 5,408). Scatter plots 

are enlarged versions of the insets in Fig. 1d. “Enr. rep X”, STARR-seq 
enrichment in replicate X. Note that the raw data for dCP have been re-analysed 
from ref. 12. 
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a Schematic overview of 5’ RACE on the STARR-seq vectors 
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< Sequencing primer (GFP-seq-rv) 


& PT or PCR or sequencing primers 


Core-promoter—enhancer pairs used: 


hkCP (RpS12 core promoter) and hkCP_19 (expected size of nested PCR product: ~350bp) 
dCP (DSCP) and intronic enhancer of zfh1 (expected size of nested PCR product: ~400bp) 
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b Agarose gel electrophoresis of 5’ RACE nested PCRs (PCR2) 
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C€ Sequence alignment and chromatogram of Sanger sequencing of the PCR product (nested PCR) of 5’ RACE of the hkCP STARR-seq vector 
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d= Sequence alignment and chromatogram of Sanger sequencing of the PCR product (nested PCR) of 5’ RACE of the dCP STARR-seq vector 
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Extended Data Figure 2 | Transcription initiates within the core promoter 
of the STARR-seq construct. a-d, 5’ Rapid amplification of cDNA ends 

(5' RACE) demonstrates that transcription initiates at the TCT and Inr motifs 
within the hkCP and dCP, respectively. a, Set-up of the 5’ RACE experiment, 
including the STARR-seq plasmid, used here with two defined enhancers, the 
STARR-seq transcript and the location of all primers used to specifically 
amplify 5’-capped STARR-seq transcripts. b, 5’ RACE nested PCR products 
separated on a 1% agarose gel. c, Screenshot of Sanger sequencing results 


Transcribed sequence 


(chromatogram and called bases) compared with the template sequence. 
Annotations are shown in green, in the following order: 5’ RACE adaptor, 
hkCP with TCT motif (only the part downstream of the TSS is annotated, 

as the 5’ part is not present in the sequenced complementary DNA), spliced 
intron, green fluorescent protein (GFP); the sequencing primer is shown in 
red (top). Also shown is a version that displays the template and Sanger 
sequencing results for the core promoter region only (zoom in). d, Same as in 
c but for the dCP for which transcription initiates within the Inr motif. 
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dCP-specific enhancers 


Extended Data Figure 3 | Specificity of hkCP and dCP enhancers to the 
hkCP and dCP assessed by luciferase assays. a, Luciferase reporter set-up with 
the hkCP or dCP (see also Fig. le). b, Luciferase signals of 24 hkCP-specific 
enhancers tested in a hkCP- (purple bars) as well as in a dCP-containing 
(brown bars) luciferase reporter. Twenty-one out of 24 hkCP enhancers 
showed luciferase activity (>1.5 fold over negative, P< 0.05 via one-sided 
unpaired Student’s t-test, n = 3) with the hkCP, while only 1 out of 24 showed 
activity with the dCP (error bars are s.d. of three biological replicates, ‘x’ 
indicates candidates that are not active with the correct core promoter, and ‘+’ 
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Shared enhancers 


indicates candidates for which the activity with the wrong core promoter is 
above the threshold (note that the activity with the correct core promoter is 
still higher in all three cases). c, As in b but testing dCP-specific enhancers. Ten 
out of 12 are positive with the dCP whereas only 2 out of 12 are positive with 
the hkCP. d, As in b and c but testing shared enhancers that were found by 
STARR-seq with hkCP and dCP; 6 out of 7 are active with both core promoters. 
See Supplementary Table 17 for the genomic coordinates of the enhancers 
and the primers used to amplify them. 
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Extended Data Figure 4 | hkCP and dCP STARR-seq signal in S2 cells 
around different core promoter types. Average hkCP (top) and dCP (bottom) 
$2 STARR-seq enrichment in 40 kb intervals around TSSs that contain different 
combinations of known core promoter motifs. Shown are (left to right) 
TATA box-Inr (179 TSSs), Inr (that do not contain either TATA box or DPE; 
1,901), Inr-DPE (100), TCT (303) and motif 1-motif 6 (266). According to 


their motif contents, the first three are developmental-type core promoters 
and the last two are housekeeping-type core promoters. Indeed, only the 
housekeeping-type core promoters show a strong enrichment of hkCP S2 
STARR-seq signals at the TSS, which is not seen for the dCP STARR-seq signal 
(owing to enhancer-core-promoter specificity) nor for the developmental-type 
core promoters (owing to the dCP enhancers location at more distal sites). 
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Extended Data Figure 5 | TSS-overlapping hkCP enhancers function 
independent of their orientation. Luciferase signals for all 17 TSS- 
overlapping hkCP enhancers (that is, containing one TSS or two divergent 
TSSs; see Supplementary Table 17) from Extended Data Fig. 3 cloned in the 
second orientation with respect to the TSS of the luciferase gene (bottom bar 
plot; the top bar plot corresponds to the initial orientation as in Extended 
Data Fig. 3 and is shown for comparison). In both orientations, 15 out of 17 
enhancers showed activity towards the hkCP (details as in Extended Data 
Fig. 3). These results together with the findings in Extended Data Fig. 3 
challenge the widespread notion that TSS-proximal sequences are promoters 
and even the concept of promoters more generally: sequences that 


autonomously activate gene expression—and are therefore often termed 
promoters—might in fact be the combination of a core promoter and a 
proximal enhancer. The TSS-proximal location of many housekeeping 
enhancers might be evolutionarily more ancient, consistent with regulatory 
mechanisms in simple eukaryotes such as yeast. In contrast, enhancers of 
genes with more complex regulation are typically located more distally, 
potentially simply because the several different cell-type-specific enhancers of 
these genes would not all fit to positions near TSSs. Consistently, such genes 
frequently have larger intergenic and intragenic regions*® known to 
accommodate enhancers with diverse activity patterns”. 
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Extended Data Figure 6 | hkCP and dCP enhancers in S2 cells are associated 
with genes of different functions and core promoter elements. a, GO analysis 
of genes next to hkCP- and dCP-specific enhancers in S2 cells using different 
enhancer-to-gene assignment strategies (top left, ‘closest TSS’ as in Fig. 2; 

top right, ‘1 kb TSS’; bottom left, ‘gene loci’; see Methods for details). Shown are 
20 non-redundant GO categories selected from the 100 most significantly 
enriched categories associated with each enhancer class (see Supplementary 


Tables 2-4 for all categories). b, Enrichment of core promoter elements at genes 
next to hkCP- and dCP-specific enhancers in S2 cells. Similar analysis as in 
Fig. 2e, but using different enhancer-to-gene assignment strategies (see 
Methods for details). Consistent with Fig. 2e, core promoters of genes assigned 
to hkCP-specific enhancers are enriched in motifs 1, 5, 6, 7 and DRE, while 
core promoters of genes assigned to dCP-specific enhancers are enriched for 
TATA box, Inr, MTE and DPE motifs, irrespective of the assignment strategy. 
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Extended Data Figure 7 | Housekeeping and developmental core promoters —_ box- and DPE-containing core promoters (Hsp70, pnr and DSCP (dCP)) 

differ characteristically in their global enhancer preferences. As in Fig. 3b suggest that differences related to these core promoter elements might be 

but including biological replicates with independently cloned focused bacterial more subtle or related to alternative mechanisms, including the potential 

artificial chromosome (BAC) libraries covering around 5 Mb of genomic preferences of more proximal or distal enhancers” or RNA polymerase II 
sequence (BAC) and assessing the PCC at each position along these regions. pausing and the dynamics versus stochasticity of initiation and 

GW, genome-wide screens as in Fig. 3b. The similarity observed for the TATA _ elongation****. 


©2015 Macmillan Publishers Limited. All rights reserved 


hkCP 
STAR 


Enrichment (log) 
° ss i 
° om oO 


ro) 
a 


= 
a oO 
f 


ok 
° 


Enrichment (log,) 


-0.5- 


° 
a 


° 
a 


hkCP OSC STARR-seq 


Fold change 


LETTER 


c 
<2x hkCP and dCP Be 
= 2x-4x hkKCP 2x-4x dCP 2 
a >4x hkCP ua >4x dCP Ny 
6 2 
g Bo 
ra a4 : Enr. rep1 (log,) 
c= 
OSC = dCP OSC fez | dCP OSC STARR-seq 
R-seq STARR-seq mo 2 =~ 
O = 3 hkCP OSC dCP OSC 
8 S 7 STARR-seq STARR-seq 
c 0 Poe el 
5° 2 de MiCore promoter MM CDS+3'UTR 
ne) S Proximal promoter Ml Intron 
-2 . a ——_—_—. ™5'uTR i intergenic 
2 0 2 4 6 Enr. rep1 (log,) 
hkCP OSC STARR-seq 
enrichment (log,) 
hkCP OSC STARR-seq signal around core promoter types 
TATA box - Inr 2.05 Inr 2.05 Inr - DPE 2.0 TCT 2.0 Motif 1 - motif 6 
= 15 is 154 —1.5- 
D> a D D> 
aS) x) g 2 
[1.0 = 1:0 = 1.05 S105 
= c = i> 
= e e z 
£05- £055 £057 £057 
3s s s s 
| i 0.0- 0 0.04 i 0.0- N 5 0.0- 
WA enn eh pens, nip 
= T T T 1 70.5 ~ T T 1 0.5- i T T 05> T T T 1 “0.5 ~ I td 
-20 -10 O 10 20 -20 -10 O 10 20 -200 -10 O 10 20 -20 -10 O 10 20 -200 -10 O 10 20 
kb from TSS kb from TSS kb from TSS kb from TSS kb from TSS 
dCP OSC STARR-seq signal around core promoter types 
TATA box - Inr 205 Inr 207 Inr- DPE 20 TCT 20 Motif 1 - motif 6 
S15 me Po) 15-5 1575 
a D a a 
aS) x) x) 2 
= 1.0 = 1.0 = 1.04 = 1.0- 
= i i+ 
oO oO oO oO 
£o5- £05 Eos Eos 
3s s s s 
aeentais 0-0 - annem nme 1 0-0- pamanyinnnnyrn, UF 0.0 + aampnynthAnrnnmty 1 0.0 ~ Aappmmned Anannerye 
t T T T 1 05 : T T 1 05 7 T T 0.5 is t T T T 1 -0.5 rr +. « . «. oo 
-20 -10 O 10 20 -20 -10 O 10 20 -200 -10 O 10 20 -20 -10 0 10 20 -20 -10 O 10 20 
kb from TSS kb from TSS kb from TSS kb from TSS kb from TSS 


Extended Data Figure 8 | hkCP and dCP enhancers differ in OSCs. 
a, b, Different enhancers activate transcription from hkCP and dCP in 
OSCs. As Fig. 1c, d but for OSCs rather than S2 cells (data in bottom inset of 


b are re-analysed from ref. 12). c, Genomic distribution of hkCP and dCP 
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enhancers in OSCs. As Fig. 2a but for OSCs rather than S2 cells. d, hkCP and 
dCP STARR-seq signal in OSCs around different core promoter types. As 
Extended Data Fig. 4 but for OSCs rather than S2 cells. 
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Extended Data Figure 9 | Differences between hkCP and dCP enhancers in 
OSCs. a, GO analysis of genes next to hkCP- and dCP-specific enhancers 

in OSCs. As Extended Data Fig. 6a but for OSCs rather than S2 cells (see 
Supplementary Tables 8-10 for all categories). b, Enrichment of core promoter 
elements at genes next to hkCP- and dCP-specific enhancers in OSCs. As Fig. 2e 
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and Extended Data Fig. 6b but for OSCs rather than S2 cells. NS, not significant 
(hypergeometric P > 0.05). c, Heat maps of hkCP (top) and dCP (bottom) 
STARR-seq enrichments in S2 cells and OSCs. Heat maps on the left and 
right are centred on the summits of core-promoter-type-specific enhancers in 
S2 and OSCs, respectively. 
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Extended Data Figure 10 | The activities of hkCP and dCP enhancers are 
dependent on DRE and GAGA motifs, respectively. a, Differential motif 
enrichment in distally located hkCP- and dCP-specific enhancers (as in Fig. 5a 
but assessing enrichments of the same motif PWMs exclusively at distal 
enhancers >500 bp away from the closest TSSs). Key motifs including DRE and 
GAGA are also differentially enriched in distal hkCP- and dCP-specific 
enhancers. NS, not significant (FDR-corrected hypergeometric P > 0.01). S2 
cells: hkCP n = 790, dCP n = 3,013; OSCs: hkCP n = 556, dCP n = 2,555. 

b, Distal hkCP- and dCP-specific enhancers are differentially bound by Dref 
and Trl, respectively. ChIP enrichments of Dref (left) and Trl (right) at $2 
hkCP- and dCP-specific enhancers that are distal (>500 bp) from the closest 
TSSs. Equivalent to Fig. 5b, but considering exclusively TSS-distal enhancers to 
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N°-methyladenosine-dependent RNA structural 
switches regulate RNA-protein interactions 


Nian Liu’, Qing Dail, Guanqun Zheng’, Chuan He!?**, Marc Parisien?+ & Tao Pan? 


RNA-binding proteins control many aspects of cellular biology 
through binding single-stranded RNA binding motifs (RBMs)'**. 
However, RBMs can be buried within their local RNA structures*’, 
thus inhibiting RNA-protein interactions. N°-methyladenosine (m°A), 
the most abundant and dynamic internal modification in eukaryotic 
messenger RNA*”, can be selectively recognized by the YTHDF2 
protein to affect the stability of cytoplasmic mRNAs", but how m°A 
achieves its wide-ranging physiological role needs further exploration. 
Here we show in human cells that m°A controls the RNA-structure- 
dependent accessibility of RBMs to affect RNA-protein interactions 
for biological regulation; we term this mechanism ‘the m°A-switch’. 
We found that m°A alters the local structure in mRNA and long 
non-coding RNA (IncRNA) to facilitate binding of heterogeneous 
nuclear ribonucleoprotein C (HNRNPC), an abundant nuclear RNA- 
binding protein responsible for pre-mRNA processing” *. Combining 


photoactivatable-ribonucleoside-enhanced crosslinking and immu- 
noprecipitation (PAR-CLIP) and anti-m°A immunoprecipitation 
(MeRIP) approaches enabled us to identify 39,060 m°A-switches 
among HNRNPC-binding sites; and global m°A reduction decreased 
HNRNPC binding at 2,798 high-confidence m°A-switches. We deter- 
mined that these m°A-switch-regulated HNRNPC-binding activities 
affect the abundance as well as alternative splicing of target mRNAs, 
demonstrating the regulatory role of m°A-switches on gene expres- 
sion and RNA maturation. Our results illustrate how RNA-binding 
proteins gain regulated access to their RBMs through m°A-dependent 
RNA structural remodelling, and provide a new direction for inves- 
tigating RNA-modification-coded cellular biology. 
Post-transcriptional m°A RNA modification is indispensable for cell 
viability and development, yet its functional mechanisms are still poorly 
understood*”. We recently identified one m°A site in a hairpin-stem 


Figure 1 | m°A alters RNA structure to enhance 
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on the human IncRNA metastasis-associated lung adenocarcinoma 
transcript (MALAT1)” (Extended Data Fig. 1a). A native gel-shift assay 
indicated that this m°A residue increases the interaction of the RNA 
hairpin with proteins in the HeLa nuclear extract (Fig. 1a). RNA pull- 
down assays identified HNRNPC as the protein component of the nuclear 
extract that binds more strongly with the m°A-modified hairpin than 
the unmodified hairpin (Fig. 1b and Extended Data Fig. 1b, c). The 
m°A-enhanced interaction with the hairpins was validated qualitatively 
by ultraviolet crosslinking and quantitatively (~8-fold increase) by filter 
binding using recombinant HNRNPC1I protein (Fig. 1c and Extended 
Data Fig. 1d). 

The HNRNPC protein belongs to the large family of ubiquitously 
expressed heterogeneous nuclear ribonucleoproteins that bind nascent 
RNA transcripts to affect pre-mRNA stability, splicing, export and 
translation’ **. HNRNPC preferably binds single-stranded U-tracts 
(five or more contiguous uridines)”°”*****?’, In the MALATI hairpin, 
HNRNPC binds a U;-tract that is half buried in the hairpin-stem oppos- 
ing the 2,577-A/m°A site (Extended Data Fig. 1a, e). 

Since m°A residues within RNA stems can destabilize the thermosta- 
bility of model RNA duplexes”*, we hypothesized that the 2,577-m°A 
residue destabilizes this MALAT! hairpin-stem to make its opposing 
U-tract more single-stranded or accessible, thus enhancing its inter- 
action with HNRNPC. We performed several experiments to validate 
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this hypothesis. First, according to the RNA structural probing assays, 
the m°A-modified hairpin showed significantly increased nuclease $1 
digestion (single-strand specific) at the GAC (A = m°A) motif, as well 
as markedly decreased RNase V1 digestion (double-strand/stacking spe- 
cific) at the U-tract opposing the GAC motif (Fig. 1d). The m°A residue 
markedly destabilized the stacking properties of the region centred around 
the U residue that pairs with 2,577-A/m°A (Extended Data Fig. 1f, g), 
which was also supported by the increased reactivity between CMCT 
and the U-tract bases in the presence of m°A (Extended Data Fig. 1h). 
Second, the 2,577—A-to-U mutation increased the HNRNPC pull-down 
amount from the nuclear extract, whereas U-to-C mutations in the 
U-tract significantly reduced the HNRNPC pull-down amount regard- 
less of m°A modification (Fig. le). Third, the 2,577—-A-to-U mutation 
increased the accessibility of the U-tract and enhanced HNRNPC bind- 
ing by ~4-fold (Extended Data Fig. 2a—c). Binding results with four 
other mutated A/m°A oligonucleotides also supported the U-tract, with 
increased accessibility alone being sufficient to enhance HNRNPC bind- 
ing (Extended Data Fig. 2d). Fourth, RNA terminal truncation followed 
by HNRNPC binding identified two pairs of truncated hairpins with 
highly accessible U-tracts, which improved HNRNPC binding sig- 
nificantly but independent of the m°A modification (Extended Data 
Fig. 2e-i). All these results confirmed that m°A modification can alter 
its local RNA structure and enhance the accessibility of its base-paired 
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Figure 2 | PAR-CLIP-MeRIP identifies m°A-switches transcriptome 

wide. a, CLIP-2dTLC showing the m°A enrichment in HNRNPC-bound 
RNA regions. Data are mean + s.d.; n = 3, biological replicates. IP, 
immunoprecipitation. b, HNRNPC-bound RNA regions had higher anti-m°A 
pull-down yield than polyA* RNA. Data are mean = s.d.; n = 3, biological 
replicates. c, Illustration of the PAR-CLIP-MeRIP protocol. UV, ultraviolet. 
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residues or nearby regions to modulate protein binding (Fig. 1f). We 
term this mechanism that regulates RNA-protein interactions through 
m°A-dependent RNA structural remodelling ‘the m°A-switch’. 

We performed two experiments to determine the global effect of 
m°A-switches on HNRNPC binding. First, in vivo crosslinking followed 
by immunoprecipitation and two-dimensional thin-layer chromatography 
(CLIP-2dTLC) showed that the m°A/A ratio of the HNRNPC-bound 
RNA regions hada ~6-fold higher m°A level than the HNRNPC-bound 
intact RNA, and a ~3-fold higher m°A level than the flow-through RNA 
(Fig. 2a and Extended Data Fig. 3a). Second, the HNRNPC-bound RNA 
regions had much higher anti-m°A pull-down yield (4.3%) than the 
polyA’ RNA samples (0.5%) using the previously established m°A 
antibody'*"* (Fig. 2b). These results indicate a widespread presence of 
m°A residues in the vicinity of HNRNPC-binding sites. 

Tomap the m°A sites around HNRNPC-binding sites, we performed 
PAR-CLIP” to isolate all HNRNPC-bound RNA regions (input control 
sample) followed by MeRIP'*"* to enrich m°A-containing HNRNPC- 
bound RNA regions (IP sample). Both the input control and IP samples 
from two biological replicates were sent for RNA sequencing (RNA-seq) 
(Fig. 2c and Extended Data Fig. 3b, c). This approach, termed PAR-CLIP- 
MeRIP, identified transcriptome-wide the m°A-proximal HNRNPC- 
binding site, such as the enriched peak around the MALATI 2,577 site 
(Fig. 2d). Remarkably, HNRNPC PAR-CLIP—MeRIP peaks harboured 
two consensus motifs, the HNRNPC RBM (U-tracts) and the m°A con- 
sensus motif GRACH (a subset of RRACH’*"*) (Fig. 2e). Both motifs 


were located mostly within 50 residues, suggesting transcriptome-wide 
RRACH-U-tract coupling events within the HNRNPC-binding sites 
(Extended Data Fig. 4a, b). About 62% of all RRACH-U-tract coupling 
events within HNRNPC-binding sites are enriched at the RRACH motif 
(Fig. 2f). Our PAR-CLIP-MeRIP approach identified a total of 39,060 
HNRNPC m°A-switches that corresponded to m°A-modified RRACH- 
U-tract coupling events at a false discovery rate (FDR) = 5% (Extended 
Data Fig. 4c). These switches account for ~7% of 592,477 HNRNPC- 
binding sites identified by PAR-CLIP. The majority (87%) of m°A- 
switches occur within introns (Extended Data Fig. 4d, e), consistent 
with the literature that HNRNPC is nuclear localized and primarily 
binds nascent transcripts””*, We validated two intronic m°A-switches 
in hairpin structures in which m®°A residues increase the U-tract acces- 
sibility and enhance HNRNPC binding by ~3-4 fold (Fig. 2g, h and 
Extended Data Fig. 5). 

Toassess the effect of global m°A reduction on RNA-HNRNPC inter- 
actions, we performed HNRNPC PAR-CLIP experiments in METTL3 
and METTL14 knockdown cells (Extended Data Fig. 6a). We identified 
16,582 coupling events with decreased U-tract-HNRNPC interactions 
upon METTL3 and METTL14 knockdown (METTL3/L14 knockdown) 
(Fig. 3a and Extended Data Fig. 6b, c). In total, 2,798 m°A-switches iden- 
tified by PAR-CLIP—MeRIP experiments showed decreased HNRNPC 
binding upon METTL3/L14 knockdown (Fig. 3b) and this number is 
probably an underestimate due to the fact that METTL3/L14 knock- 
down reduces the global m°A level by only ~30-40% (refs 11, 12). These 
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Figure 3 | Global m°A reduction decreases HNRNPC binding at 
m°A-switches. a, Density plot showing negative enrichment at the U-tracts. 
KD, knockdown. b, Identification of high-confidence (HCS) m°A-switches. 
c, Regional distribution of high-confidence m°A-switches. CDS, coding 
sequence. d, Density plot showing m°A-switch distribution relative to 
exon/intron boundaries. e, m°A-switches in coding RNA were enriched in the 
3’ UTR and near the stop codon. f, Cumulative distribution of HCS 
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m°A-switches (black) and control (orange) regarding the $1/V1 cleavage 
preference (data from ref. 4) at U-tracts and RRACH motif. U-tract can be 
3’ (top) or 5’ (bottom) of the RRACH motif. *P < 0.05, **P<10 +, 
Kolmogorov-Smirnov test. g, Phylogenetic conservation of high-confidence 
m°A-switches among primates and vertebrates. ***P< 10 !°, Mann- 
Whitney-Wilcoxon test. 
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sites composed the high-confidence m°A-switches that were used for 
subsequent analysis. 

High-confidence m°A-switches are enriched in the introns of coding 
and non-coding RNAs (Fig. 3c and Extended Data Fig. 6d). Exonic 
m°A-switches are enriched at the middle of exons whereas intronic m°A- 
switches are slightly enriched near the 5’ end (Fig. 3d). m°A-switches 
within coding RNAs tend to locate at very long exons (Extended Data 
Fig. 6e) and are enriched near the stop codon and in the 3’ untranslated 
region (UTR) (Fig. 3e), consistent with the known topology of the human 
m°A methylome in mRNAs". Transcriptome-wide RNA structural 
mapping*” on high-confidence m°A-switches yielded consistent struc- 
tural patterns with our three demonstrated m°A-switch hairpins (Fig. 3f). 
The RR residues in the RRACH motif and the 3’ U-tract residues show 
increased structural dynamics in the presence of m°A. Besides, m°A- 
switches prefer short RRACH-U-tract inter-motif distances, are not 
involved in the previously reported inter-U-tract motif patterns and 
are conserved across species (Fig. 3g and Extended Data Fig. 6f-i). 

To reveal the function of m°A-switches on RNA biology, we performed 
polyA* RNA-seq from HNRNPC, METTL3 and METTL14 knockdown 
and control cells (Extended Data Fig. 7a). METTL3/L14 knockdown, 
which has been shown to decrease HNRNPC binding transcriptome- 
wide, co-regulated the expression of 5,251 genes with HNRNPC knock- 
down. In comparison, METTL3/L14 knockdown co-regulated only 24 
genes with knockdown of another mRNA-binding protein, HNRNPU 
(Extended Data Fig. 7b), which was not enriched in our m°A-hairpin 
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Figure 4 | m°A-switches regulate mRNA abundance and alternative 
splicing. a, HNRNPC, METTL3/L14 knockdown (KD) co-regulated the 
abundance of m°A-switch-containing transcripts by RNA-seq and qPCR. Ctrl, 
control. b, Illustration of the relative exon distance to m°A-switches. c, Co- 
regulated exons by HNRNPC knockdown and METTL3 knockdown (left) and 
METTL14 knockdown (right) were more enriched around m°A-switch sites 
than non-co-regulated exons, Kolmogorov-Smirnov test. HCS, high 


Exon exclusion 
L a: Sa? 


LETTER 


pull down (Fig. 1b). Approximately 45% of 1,815 high-confidence m°A- 
switch-containing genes were co-regulated by HNRNPC and METTL3/ 
L14knockdown, indicating that m°A-switch-regulated HNRNPC bind- 
ing affects the abundance of target mRNAs. Gene ontology (GO) analysis 
suggests that m°A-switch-regulated gene expression may influence ‘cell 
proliferation’ and other biological processes (Extended Data Fig. 7c). The 
m°A-switch-regulated expression of genes within these GO categories 
was validated by quantitative polymerase chain reaction (qPCR) (Fig. 4a 
and Extended Data Fig. 7d-g). We also found that HNRNPC, METTL3 
and METTL14 knockdown decreased the cell proliferation rate to sim- 
ilar extents (Extended Data Fig. 7h). 

Besides the mRNA abundance level changes, we also observed splic- 
ing pattern changes within high-confidence m°A-switch-containing tran- 
scripts by testing the differential exon usage in RNA-seq data (DEXSeq)”. 
HNRNPC knockdown co-up/downregulated 131/127 exons with METTL3 
knockdown and 130/115 exons with METTL14 knockdown. These co- 
regulated exons occur more frequently in the vicinity of m°A-switches 
than non-co-regulated exons (Fig. 4b, c), indicating that m°A-switches 
tend to regulate splicing events at nearby exons. We investigated the 
splicing pattern at two exons with neighbouring m°A-switches: the 
PAR-CLIP—MeRIP and METTL3/L14 knockdown data confirmed the 
HNRNPC-binding signature at the m°A-switch site neighbouring these 
exons; and HNRNPC and METTL3/L14 knockdown co-inhibited exon 
inclusion in both cases (Fig. 4d-fand Extended Data Fig. 8b-f). Besides, 
we identified 155 genes with multiple m°A-switches exhibiting more 
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than two splice variants, and 221 m°A-switch-containing genes with 
differentially expressed splice variants in HNRNPC and METTL3/L14 
knockdown samples. Further analysis suggested that m°A-switches have 
an effect on intron exclusion (Extended Data Fig. 8g). Consistent with 
previous reports about splicing regulation by both HNRNPC and 
m°A!!92023 our results indicate that m°A functions as an RNA struc- 
ture remodeller to affect mRNA maturation through interference with 
post-transcriptional regulator binding activities. 

We demonstrated that post-transcriptional m°A modifications could 
modulate the structure of coding and non-coding RNAs to regulate 
RNA-HNRNPC interactions, thus influencing gene expression and 
maturation in the nucleus. It is possible that m°A could also recruit addi- 
tional accessory factors, such as the YTH domain proteins, which can 
directly recognize m°A—as previously reported'*—to destabilize the 
RNA structure and facilitate HNRNPC binding. Besides HNRNPC, 
m°A-switches may regulate the function of many other RNA-binding 
proteins through modulating the RNA-structure-dependent access- 
ibility of their RBMs. Our work indicates widespread m°A-induced 
mRNA and IncRNA structural remodelling that affects RNA-protein 
interactions for biological regulation. 


Online Content Methods, along with any additional Extended Data display items 
and Source Data, are available in the online version of the paper; references unique 
to these sections appear only in the online paper. 
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METHODS 

Mammalian cell culture, siRNA knockdown and western blot. Human cervical 
cancer cell line HeLa (CCL-2) and embryonic kidney cell line HEK293T (CRL- 
11268) were obtained from the American Type Culture Collection (ATCC) and were 
cultured under standard conditions. Control short interfering RNA (siRNA) (1027281, 
Qiagen), METTL3 siRNA (SI04317096, Qiagen), METTL14 siRNA (SI04317096, 
Qiagen) or HNRNPC siRNA (10620318, Invitrogen) were transfected into HEK293T 
cells at a concentration of 40 nM using lipofectamine RNAiMAX (Invitrogen) accord- 
ing to the manufacturer’s instructions. Cells were collected 48 h after the transfection, 
shock-frozen in liquid nitrogen, and stored at —80 °C for further studies. Western 
blot analysis using METTL3- (HPA038002, Sigma), METTL14- (HPA038002, Sigma), 
HNRNPC- (sc-32308, Santa Cruz) and GAPDH- (A00192-40, Genescript) specific 
antibodies was performed under standard procedures. Blotting membranes were 
stained by ECL-prime (RPN2232, GE Healthcare) and visualized by a digital imag- 
ing system (G: BOX, SYNGENE). All synthetic oligonucleotides were synthesized 
by Q.D. 

Gel shift, RNA pull-down and filter-binding assays. HeLa nuclear extracts were 
isolated using the NE-PER Nuclear and Cytoplasmic Extraction Reagents (78833, 
Thermo Scientific) according to the manufacturer’s instructions. The purified radio- 
actively labelled RNA oligonucleotides were refolded by heating at 90 °C for 1 min, 
then at 30°C for 5 min. Three microlitres HeLa nuclear extract and 6 ll refolded 
RNA were incubated at room temperature for 30 min and then at 4 °C for 2 h. Each 
sample was mixed with 1 1] 50% glycerol, separated on an 8% native 1X TBE gel, and 
visualized by phosphorimaging using the Personal Molecular Imager (Bio-Rad). 

The in vitro pull-down assay was performed as described’’. The eluted protein 
samples were separated on 4-12% polyacrylamide Bis-Tris gels (NP0321BOX, 
Invitrogen) and stained with SYPRO-Ruby (S12000, Invitrogen) according to the 
manufacturer’s instructions. Protein in gel slices or the entire pulled-down protein 
samples were digested with trypsin and identified using Liquid chromatography- 
tandem mass spectrometry by the Donald Danforth Plant Science Center (Wash- 
ington University). The RNA oligonucleotides used in Fig. 1f were: 2,577-U, 5'-A 
ACUUAAUGUUUUUGCAUUGGUCUUUGAGUUA -Biotin; 2,577-CC-A, 5'- A 
ACUUAAUGUCCUUGCAUUGGACUUUGAGUUA-Biotin; 2,577-CC-m*A, 
5/- AACUUAAUGUCCUUGCAUUGGmACUUUGAGUUA -Biotin. 

The full-length HNRNPC1 protein was purified and the in vitro ultraviolet cross- 
linking assay was performed as previously described”. Filter-binding assays were 
performed as previously described™. 

CLIP-2dTLC. HEK293T cells at 70-80% confluency were ultraviolet irradiated with 
400 mJ cm 7 at 254 nm, and harvested by centrifuging at 4,000 r.p.m. for 3 min at 
4°C (with centrifugation rotor 75003524, Fisher Scientific). The pellet of cross- 
linked cells were resuspended in 1 ml lysis buffer (1 PBS, 0.1% SDS, 1% Nonidet 
P-40, 0.5% sodium deoxycholate, protease inhibitor cocktail and RNase inhibitor) 
and incubated on ice for 4h. Cell lysate was isolated by centrifuging at 3,000 r.p.m. 
for 5 min and pre-blocked with 50 il protein A beads in 300 Ll lysis buffer. Another 
50 pl protein A beads (Invitrogen) were incubated with 8 1g corresponding anti- 
bodies for 4h at room temperature, and then mixed with the pre-blocked cell lysate 
at 4 °C overnight. The beads were washed three times with 1 ml wash buffer (20 mM 
Tris-HCl pH 7.4, 10 mM MgCh, 0.2% Tween-20), three times with 1 ml high-salt 
buffer (5 PBS, 0.1% SDS, 1% Nonidet P-40, 0.5% sodium deoxycholate), and three 
times with 1 ml wash buffer. The beads were resuspended in 1 ml wash buffer, and 
divided into 2% 500 1] in two separate tubes. One tube was incubated with 200 pl 
RNase T1/A mixture at room temperature for 1 h. The other tube was incubated 
with 200 ll nuclease-free water at room temperature for 1 h. The beads were washed 
three times with 1 ml high-salt buffer, and three times with 1 ml wash buffer. Cross- 
linked RNA was eluted from beads by incubating with 200 1] RNA elution buffer 
(100 mM Tris-HCl pH 7.4, 10 mM EDTA, 1% SDS) containing 2 mg ml! protei- 
nase K at 50 °C for 30 min followed by phenol/chloroform extraction. The RNA pellet 
was dissolved in 7 1l nuclease-free water containing 1 jl RNase T1 (200 U), heated at 
65 °C for 2 min, and incubated at 37 °C for 30 min. The T1-digested RNA fragments 
were labelled upon adding 2 jil T4 PNK mix (4.5 U pl ' T4 PNK, 600 Cimmol* 
[y-°?P] ATP, 5X PNK buffer) and incubation at 37°C for 30 min. Unreacted 
[y-?P] ATP was removed using Illustra MicroSpin G-25 columns. The eluted 
RNA was digested with 1 pil (1 U pl’) nuclease P1 at 37 °C for 1 h. Samples were 
spotted on cellulose TLC plate and 2dTLC was run as described” using isobutyric 
acid: 0.5 M NH,OH (5:3, v/v) as the first dimension and isopropanol:HCl:water 
(70:15:15, v/v/v) as the second dimension. 

RNA structural probing and RNA terminal truncation. The synthetic RNA oli- 
gonucleotides were 5’-end-labelled with y-**P-ATP by T4 PNK (70031, Affymetrix), 
gel purified, and re-folded. Structural probing assay with RNase T1, nuclease S1 
and RNase V1 was performed as previously described”’. Note that 3’-end-labelled 
HNRNPH1 oligonucleotides were used for the RNA structural probing assay shown 
in Fig. 2g. 
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CMCT RNA structural probing assay was performed as reported". RNA refold- 
ing: 3 pmol RNA was annealed in 50 mM potassium borate (pH 8) by heating at 
90°C for 1.5 min then incubation at room temperature for 3 min. 

RNA terminal truncation assay was carried out as previously reported**. RNA 

samples were first alkaline-hydrolysed as in the RNA structural probing assay, and 
then incubated with HNRNPCI protein in the same conditions as in the filter bind- 
ing assay. The RNA-protein complexes were then loaded onto filter papers and 
washed twice with chilled binding buffer. Air-dry filters and RNA samples were 
then extracted from the filters and loaded onto denaturing gel as in the RNA 
structural probing assay. 
PAR-CLIP and PAR-CLIP-MeRIP. PAR-CLIP procedures were performed as 
previously reported” with the following modification. HEK293T cells in 15-cm 
plates treated following normal PAR-CLIP procedures were lysed and digested 
with a combination of RNase I (Ambion, AM2295, 15 pl 1/50 diluted with H,O) 
and Turbo DNases (2 ll) for 3 min at 37 °C, shaking at 1,100 r.p.m. The lysate was 
then immediately cleared by spinning at 14,000 r.p.m., 4 °C for 30 min, and placed 
on ice for further use. HNRNPC-binding sites were identified by PARalyzer v.1.1 
(ref. 33) with default settings. 

PAR-CLIP-MeRIP experiment applied m°A-antibody immunoprecipitation 
to the HNRNPC PAR-CLIP RNA samples. The HNRNPC PAR-CLIP RNA sam- 
ple was incubated with m°A-specific antibody (202003, SYSY), RNase inhibitor 
(80 units, Sigma-Aldrich), human placental RNase inhibitor (NEB) in 200 pl 1x IP 
buffer (50 mM Tris-HCl pH 7.4, 750 mM NaCl and 0.5% (v/v) Igepal CA-630) at 
4°C for 2h under gentle shaking conditions. For each PAR-CLIP-MeRIP experi- 
ment, 20 pl protein A beads (Invitrogen) were washed twice with 1 ml 1X IP buffer, 
blocked with 2 h incubation with 100 pl 1X IP buffer supplemented with bovine 
serum albumin (BSA) (0.5 mg ml 1), RNasin and human placental RNase inhib- 
itor, and then washed twice with 100 pil 1X IP buffer. The pre-blocked protein A 
beads were then combined with the prepared immuno-reaction mixture and incu- 
bated at 4 °C for 2 h, followed by three washes with 100 jul 1X IP buffer. After that, 
the RNA was eluted by 1 h incubation with 20 1] elution buffer (1X IP buffer and 
6.7mM m°A, Sigma-Aldrich) under gentle shaking conditions, and purified by eth- 
anol precipitation. The purified RNA sample (IP) as well as the input PAR-CLIP 
RNA sample (input control) were used for library construction by Truseq small 
RNA sample preparation kit (Illumina). 

Libraries were prepared using TruSeq Small RNA Sample Preparation Kit (RS- 

200-0012, Illumina) according to the manufacturer’s instructions, and then sequenced 
by Illumina Hiseq2000 with single-end 50-bp read length. The control and IP sam- 
ples from PAR-CLIP-MeRIP experiments (same case for the control and knock- 
down samples from METTL knockdown experiments) were sequenced together in 
one flowcell on two lanes, and the reads from two lanes of each sample were com- 
bined for remaining analysis. The raw sequencing data were trimmed using the 
Trimmomatic computer program v.0.30 (ref. 35) to remove adaptor sequences, and 
mapped to the human genome version hg19 by Bowtie 1.0.0 (ref. 36) without any 
gaps and allowed for at most two mismatches. 
Detection of PAR-CLIP-MeRIP peaks and differential PAR-CLIP peaks. The 
raw read counts of the biological replicates confirmed the reproducibility between 
replicates (Extended Data Fig. 9), and replicates were combined for subsequent 
analysis. For each genomic site, we calculated the average read counts within an 
11-nucleotide window centred at that site, as the normalized read counts for that site. 
This normalization smoothed the raw mapping curves, and facilitated identification 
of peaks within each mapping cluster. To correct for changes in sequencing depth 
or expression levels between samples, we then normalized the read counts at each 
genomic site to the total number of read counts on the respective gene. The above 
defined double-normalization procedures enabled precise identification of changes 
in the mapping reads at specific genomic locations by directly comparing the nor- 
malized read counts between samples. No read counts in the intergenic region were 
compared between samples, because the transcription boundaries are not defined 
at this region and the intergenic read counts cannot be normalized to correct changes 
for transcript expression. 

Detection of PAR-CLIP-MeRIP peaks involves comparing the read counts of 
the IP sample with that of the control (Ctrl) sample as follows: (1) we identified all 
peaks within HNRNPC-binding sites in the IP sample; (2) we performed tran- 
scriptome-wide scanning to compare read counts of each identified peak in (1) with 
read counts at the same genomic locations in the Ctrl sample to calculate the fold 
change score, score = logs (Hip/Hcii). The score threshold was set to be 1, corres- 
ponding to a twofold increase compared with control. 

The detection of decreased HNRNPC-binding sites involved comparing HNRNPC 
occupancies in the METTL knockdown (KD) sample with that in the control as 
follows: (1) we identified all peaks within HNRNPC-binding sites in the METTL 
knockdown sample; (2) we performed transcriptome-wide scanning to compare 
read counts of each identified peak in (1) with read counts at the same genomic 
locations in control to calculate the fold change score, score = logy (Hxp/Hcu). 
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The score threshold was set to be —1, corresponding to a twofold decrease com- 
pared with control. 

Identification of enriched motifs and HNRNPC m°A-switches. To identify 
enriched motifs, we first sorted the 12,998 HNRNPC PAR-CLIP-MeRIP peaks (with 
IP/input enrichment = 2) by the T-to-C mutation frequency. We then chose the top 
4,500 peaks with the highest T-to-C mutation frequency for motif analysis using 
FIRE” with default RNA analysis parameters. The top two enriched motifs are the 
GRACH and the U-tract motif. We also used the top 1,024 and 2,048 peaks for 
motif analysis, yielding the same motif results as the top 4,500 peaks. 

To identify transcriptome-wide HNRNPC m°A-switches, we first searched for 
all coupling events within 50 nucleotides between U; and RRACH motifs, with the 
U; motif located within HNRNPC-binding sites. For PAR-CLIP-MeRIP samples, 
the fold-change score E at the RRACH motif was calculated for each coupling event. 
Also, the P value for each coupling event was calculated as described**. Then, we 
generated the z value, z = E-(—logy9 P), as one comprehensive parameter to pick 
meaningful genomic loci”. HNRNPC m°A-switches identified from PAR-CLIP- 
MeRIP experiments should fulfill the following requirements: (1) read counts at both 
the control and IP sample = 5; (2) 2 value = 0.627, corresponding to FDR = 5%. 

For METTL knockdown samples, the fold-change score at the U-tract motifs 
was calculated for each coupling event. HNRNPC m°A-switches identified from 
METTL3/L14 knockdown samples should fulfil the following requirements: (1) read 
counts at both the control and knockdown sample = 5; (2) 2 value = 0.627, cor- 
responding to FDR = 5%. 

Distribution of HNRNPC m°A-switches. Pie charts illustrating the distribution 
within each segment were made using the following hierarchy: intron > ncRNA > 
3' UTR>5' UTR> CDS > intergenic. To plot the distribution of HNRNPC m°A- 
switches in their respective localized segments (such as intron, exon, 3’ UTR, CDS, 
5’ UTR), we first identified the distance between each m°A-switch and the 5’ end 
of the respective segment. This distance was then divided by the length of that 
segment to determine a percentile where this m°A-switch fell, and then this specific 
percentile bin was incremented. Following this approach, we obtained the distri- 
bution pattern of all m°A-switches within each segment. 

RNA-seq. RNA-seq experiments were performed on two replicate RNA samples 
from HNRNPC, METTL3, METTL14 knockdown as well as control HEK293T 
cells (48 h after transfection). Total RNA samples were extracted according to the 
RNeasy Plus Kit (catalogue no. 74104, Qiagen). Libraries were prepared according 
to the TruSeq Stranded mRNA LT Sample Prep Kit (catalogue no. RS-122-9005DOC). 
Knockdown and control samples were sequenced together in one flowcell on four 
lanes, respectively. All samples were sequenced by illumina Hiseq 2000 with pair 
end 100-bp read length. The reads from the four lanes of each sample were com- 
bined for all analyses. The RNA-seq data were mapped using the splice-aware 
alignment algorithm TopHat v.1.1.4 (ref. 40) based on the following parameters: 
tophat -num-threads 8 -mate-inner-dist 200 -solexa-quals -min-isoform-fraction 
0 -coverage-search-segment-mismatches 1. Gene expression level changes were 
analysed using cuffdiff"’. Differential splicing was determined using DEXSeq”* based 
on Cufflinks-predicted, non-overlapping exons. To compare with a different mRNA- 
binding protein, the RNA-seq data from HNRNPU knockdown HEK293T cells 
(GEO34995 data set’”) were analysed. 

GO, evolutionary conservation, graphic and statistical analyses. GO enrichment 
analysis was applied on the co-regulated high-confidence m°A-switch-containing 
genes, against all high-confidence m°A-switch-containing genes as background, 
using GOrilla®. 

Phylogenetic conservation analysis was performed by comparing PhyloP scores 
at the U-tract motif and RRACH motif for HNRNPC m°A-switches to those of ran- 
domly selected sequences. The PhyloP scores were accessed from the precompiled 
PhyloP scores“ (ftp://hgdownload.soe.ucsc.edu/goldenPath/hg19/phyloP46way/) 
under both primate and vertebrate categories. P values were evaluated using the 
Mann-Whitney-Wilcoxon test, ***P < 107 16 For the U-tract motifs, we collected 
all U-tracts (5X Us) across all chromosomes and randomly selected 10,000 sites 
among the 38,561,577 sites of our census. The random selection was done sepa- 
rately for primates and for vertebrates. For the RRACH motif, we also collected all 
RRACH sites across all chromosomes and randomly selected 10,000 sites among 
the 78,815,225 sites of our census. Here, too, the random selection was done sep- 
arately for primates and vertebrates. 

Sequence logos were generated using the WebLogo package. The R statistical 
package was used for all statistical analyses (unless stated otherwise). 

Cell proliferation analysis. HEK293T cells were transfected with si-control, si- 
HNRNPG, si-METTL3 and si-METTL14 RNAs. After transfection, the numbers of 
cells were counted at 0, 24, 48 and 72 h, as described previously**. Three independent 
experiments were performed and growth curves were plotted to test the effects on 
cell proliferation. 

RT-PCR quantitation. Total RNA samples were extracted from HEK293T cells 
and reverse transcribed using SuperScript III First-Strand Synthesis System (Life 


Technologies, catalogue no. 18080-051). In order to validate the splicing changes 
identified from our RNA-seq data, we performed RT-PCR measurements using 
Thermo Scientific Taq DNA Polymerase under the following conditions: 95 °C for 
3 min, 30 cycles of 95 °C for 30 s, 55 °C for 30 s, 72 °C for 1 min, and then finally 72 °C 
for 10 min. For the target alternate exon, we designed and used primers annealing 
to both neighbouring constitutive exons. The PCR products were separated on 1.2% 
agarose gel and ethidium bromide stained. In order to validate the gene expression 
level changes identified from our RNA-seq data, we performed qRT-PCR mea- 
surements using Power SYBR Green PCR Master Mix (Life Technology, catalogue 
no. 4367659) under the following conditions: 50 °C for 3 min followed by 95 °C for 
10 min, 40 cycles of 95 °C for 15s, 60 °C for 1 min, and then 40 °C for 1 min and 
95 °C for 15s and finally 60 °C for 30s. 

The primer sequences are as follows (listed as gene name: forward primer; reverse 
primer). ANAPC1: TGCCAAAAGAAATAGCAGTTCAG; TGCCAAAAGAAA 
TAGCAGTTCAG; ANLN: GCCAGGCGAGAGAATCTTCA; GGCTGCTGGTT 
ACTTGCTTC; SRSF6: ACAAGGAACGAACAAATGAGGG; GCTTCCAGAGT 
AAGATCGCCTAT; E2F8: ACCCAAGCTCAGCCATTGTA; GAGTCATAGTT 
GGTGGCCCT; HIPK1: CCAGTCAGCTTTGTACCCATC; TTGAAACGCAGG 
TGGACATA; DNAJA3: CCCTTTCATTTGTACTGCCTCC; TGATCTCTTTCT 
GGCTGGCA; STAMBP: GTTCTCATCCCCAAGCAAAG; ATCCAGCCCAGT 
GTGATGA; ARHGAPS5: GCGGATTCCATTTGACCTCC; GCTGCCCTGGTG 
AAATGAAT; ROBOI: TTTGGGCTTCTGCGTAGTTT; GGAGGGTACTGGA 
GACAGCA; SRPK1: CCCTGAGAAGAGAGCCACTG; ACCCTGAAAAGGGA 
AGAGGA; CENPK: AAGGCTAAAAATTCACAAAGCA; TCCATATCTTTCC 
ACATTTCTTCA; BCLAF1: TCCTGAAAGGTCTGGGTCTG; TCCTGAAAGG 
TCTGGGTCTG; SUDS3: T@€CCTGGGGTTCTGTATTTC; CAGTTCAAGCGA 
GGGAAGTC; DYRKIA: CTTCAGCATGCAAACCTTCA; GGCAGAAACCTG 
TTGGTCAC; SMEK1: TTGAAGGACTGCACCACTTG;, CCTGTGTTTTCGT 
GGTTGTG; ATP6V1IA: AAGCATTTCCCCTCTGTCAA; CTGCCAGGTCTTC 
TTCTTCC; KPNA6: CCCTGTGTTGATCGAAATCC; GATCTGCTCAGGGG 
TTCCTC; TBC1D23: GGTGAATCTCCTAATGGCTCA; CGATCCACAGGAG 
TTGATGT; GPBP1: CGTCATTGAATTTTGAGAAGCA; TTAGGACGCCCA 
ATAGCAGA; MTF2: GTCTGCATTTGGTTCCTGGT; CTGCAGGAAAGGCA 
ACCTTA; ATP6V0A1: TCCGTGTCTGGTTCATCAAA; TCTGAGTGCAAAC 
TGGATGG; MAP4K3: TCTTCATACCACAGGAAATGC; AACAGGTTTGTG 
TGGGGGTA; SUMO2: TTCTTTCATTTCCCCCTTCC; TATTTTTCCCCATC 
CCGTCT; MAP3K3: CAGTTCCTCTCCCCACTCTG; GACAGAGAGGTGCC 
TGCTTC; CD82: CGATTTTCCCAGGATGACAG; GAAAGGGCCCTATTGAG 
GAC; YTHD1F2: ACTTGAGTCCACAGGCAAGG; AAGCAGCTTCACCCAA 
AGAA. 
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Extended Data Figure 1 | m°A increases the accessibility of the U-tract to 
enhance HNRNPC binding. a, Secondary structure of the MALAT1 hairpin 
with m°A methylation at the 2,577 site shown in red’*. Nucleotide position 
numbers correspond to their locations along the human MALAT1 transcript 
(NCBI accession NR_002819). b, RNA pull down showing that HNRNPC 
preferably binds methylated RNA. c, The list of proteins with identified 
peptides by mass spectrometry in b. d, Recombinant HNRNPC1 binds more 
strongly with that MALATI 2,577-m°A hairpin compared with the 
unmethylated hairpin, as determined by an in vitro ultraviolet crosslinking 
assay~’. ey HNRNPC shows binding around the 2,577-A site along MALAT1 
in vivo, as determined by previously published HNRNPC iCLIP data’’. The 
underlying genomic sequence is shown at the bottom with a red square 
marking the 2,577-m°A site. The slight shift of the iCLIP signal to upstream of 
the U-tract-binding site is probably due to the steric hindrance of the peptide 
fragment remaining on RNA, which can cause reverse transcription to 


terminate more than one nucleotide upstream of the crosslink site”. 

f, Quantification of the RNase V1 cleavage signal for the U-tract region from the 
RNA structural mapping assay in Fig. le. To correct for sample loading 
difference, each band signal was normalized to the band signal of the immediate 
3’ residue to the U-tract. Data are mean + s.d.; n = 3, technical replicates. 

g, Quantitative analysis of the RNase T1 cleavage signal from the RNA 
structural mapping assay in Fig. le. An increased RNase T1 cleavage signal 
(single-strand specific and cleavage after guanosines) was observed due to the 
surrounding m°A residue. To correct for sample loading difference, the ratio for 
each band signal among all bands in each lane was calculated. Relative T1 
cleavage = (mA native!’ m°A denature)! (A native! ‘A denature)» n= 2; technical 
replicates. h, Quantitative CMCT mapping showing increased signals for the 
U-tract bases around the U base-pairing with m°A. Quantitation of band 
signals within the U-tract region is shown on the right. Data are mean + s.d.; 
n =A, technical replicates. 
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Extended Data Figure 2 | Increased accessibility of U-tracts enhances 
HNRNPC binding. a, Structure probing of the 2,577-A-to-U mutated 
MALAT1 hairpin (2,577-U). The annotation is the same as in Fig. 1d. 

b, Quantification of the RNase V1 cleavage signal for the U-tract region from 
RNA structural mapping assays as in a. To correct for sample loading 
difference, each band signal was normalized to the band signal of the 3’-most U 
of the U-tract. n = 2, technical replicates. c, Filter-binding curves displaying 
the binding affinities between recombinant HNRNPC1 and 2,577-U/A 
oligonucleotides. Data are mean + s.d.; n = 3, technical replicates. d, Filter- 
binding results showing the binding affinities between recombinant 
HNRNPC1 and four mutated MALAT1 oligonucleotides. (1) Mutate G-C to 
C-C, 2,577-A: predicted to weaken the hairpin stem and increase HNRNPC 
binding. Results: binding improved from 722 nM Ky, to 142 nM (fivefold). 

(2) Mutate G-C to C-C, 2,577-m°A: in this context of weaker stem, m°A is 
predicted to confer a smaller effect compared to wild-type hairpin. Result: 
improved binding only twofold instead of eightfold. (3) Restore C-C to C-G, 
2,577-A: predicted to restore the hairpin stem and decrease HNRNPC binding 
compared to C-C mutant. Result: binding decreased by 6.4-fold. (4) Restore 
C-C to C-G, 2,577-m°A: in this context of restored stem, m°A is again 
predicted to confer increased binding compared to 2,577-A hairpin. Result: 
improved binding by 2.5-fold. Data are mean + s.d.; n = 3 each, technical 
replicates. e, RNA alkaline hydrolysis terminal truncation assay showing 
recombinant HNRNPC1 binding to terminal truncated MALATI hairpin 


oligonucleotides (2,577 site m°A methylated or unmethylated). In this assay, 
3'-radiolabelled MALATI 2,577 hairpin oligonucleotides were terminal 
truncated by alkaline hydrolysis into RNA fragments that were then incubated 
with HNRNPCI protein followed by filter binding wash steps. The remaining 
RNA on the filter paper was isolated and analysed by denaturing gel 
electrophoresis, as indicated in the lane “C1-bound or C1-B’. ‘Input’ refers to 
alkaline-hydrolysis-truncated RNA oligonucleotides used for incubation with 
hnRNP C1; ‘G-L or G-ladder’ was generated from RNase T1 digestion; ‘Ctrl’ 
refers to the intact MALAT1 hairpin without alkaline hydrolysis truncation. 
One pair of methylated/unmethylated truncated oligonucleotides (CUT1, 
marked by green arrows) was selected for subsequent biochemical analysis, 
due to their strong interaction with HNRNPC1. f, RNA terminal truncation 
assay as in e except 5’ °*P-labelled oligonucleotides were used. One pair of 
methylated/unmethylated truncated oligonucleotides (CUT2, marked by green 
arrows) was selected for subsequent biochemical analysis. g, Structure probing 
of the CUT1 oligonucleotides using RNase V1 and nuclease S1 digestion. 
Annotation is the same as in Fig. le. The red dot marks the m®A site and 

the red line marks the U-tract region. h, Structure probing of the CUT2 
oligonucleotides using RNase V1 and nuclease S1 digestion. Annotation is 
the same as in g. i, Truncated oligonucleotides with exposed U-tracts 
increased HNRNPC binding regardless of m°A. Data are mean + s.d.; 

n = 3, technical replicates. 
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immunoprecipitation; nt, nucleotide; UV, ultraviolet. The RNase T1 used in (RNA samples within RNA-HNRNPC crosslinked complexes) were extracted 
our 2d TLC assay cleaves single-stranded RNA after guanosines, so the m°A/A _ from the gel slices marked by the red rectangle. c, Denaturing gel analysing the 
ratio determined here represents the m°A fraction of all adenosines size distribution for the HNRNPC PAR-CLIP RNA samples (lane 2). The 
following guanosines. b, Analysis of crosslinked RNA-HNRNPC complexes RNA size standards were loaded in lanes 1 and 3. 
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Extended Data Figure 4 | PAR-CLIP-MeRIP identifies transcriptome-wide 
m°A-switches in the vicinity of HNRNPC-binding sites. a, Density plots 
illustrating the distribution of distance between the PAR-CLIP-MeRIP/input 
peaks and the nearest GRACH motif (top) or the nearest U-tracts (bottom). 
b, Definition and identification of HNRNPC m°A-switches based on the PAR- 
CLIP-MeRIP analysis. Approximately 89% of PAR-CLIP-MeRIP peaks 
harbouring both the U-tract and RRACH motifs have an RRACH-U-tract 
inter-motif distance within 50 nucleotides, significantly higher than the 64% of 
such coupling within the genomes. HNRNPC m°A-switches are identified 

as m°A-methylated RRACH-U-tract coupling events. c, Volcano plot 
depicting all coupling events (open circles) as defined in b, according to their 
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P values*® (P; y axis) and fold-change values at RRACH sites (E; x axis). To 
identify HNRNPC m°A-switches, we generated the z value, z = E-(—logioP), 
as one comprehensive parameter to pick meaningful genomic loci*’, HNRNPC 
m°A-switches identified from PAR-CLIP-MeRIP experiments should fulfil 
the following requirements: (1) read counts at both the control and IP 
sample = 5; (2) z value = 0.627, corresponding to FDR = 5%. d, Pie chart 
depicting the region distribution of HNRNPC m°A-switches identified by 
PAR-CLIP-MeRIP. e, Pie chart depicting HNRNPC PAR-CLIP peaks. These 
are enriched in introns, consistent with previous reports that HNRNPC binds 
mainly nascent transcripts’””*”®. 
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Extended Data Figure 5 | Validation of two identified m°A-switches. 

a, b, PAR-CLIP-MeRIP data detected positive IP/input enrichment at the 
RRACH sites (red arrowheads) on the DNAJC25-GNG10 gene (a) and 
HNRNPH1 gene (b) in HEK293T cells. c, d, Quantification of RNase V1 
cleavage signals around the U-tract region of m°A-switches on the DNAJC25- 
GNGI1O0 (c) and HNRNPH1I (d) transcript, related to Fig. 2g, h. Data are 
mean + s.d.; = 3, technical replicates each. e, Quantitative CMCT mapping of 
DNAJC25-GNG10 m°A-switch shows increased band signals around the 
uridine base that pairs with m°A. The red vertical line marks the U-tract region. 
Quantitation of band signal for the U-tract region is shown on the right. Data 
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are mean + s.d.; n = 3, technical replicates. The HNRNPH1 m°A-switch 
hairpin is not suitable for CMCT probing, because its reverse transcription 
binding primer region is too short. f, g, In vivo DMS mapping of the DNAJC25- 
GNG10 hairpin (f) and HNRNPH1 (g); data are from ref. 7. A and C residues are 
marked with orange dots and the m°A residue is marked with a red dot. 

The hairpin loops are indicated by red bars. h, Transcriptome-wide $1/V1 
mapping around the HNRNPH1 m°A-switch site. Blue bars represent V1 
signal; magenta bars represent S1 signal. The hairpin loop is indicated by a red 
bar; data are from ref. 4. Not enough reads could be collected to make a plot for 
the DNAJC25-GNG10 m®°A-switch region. 
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Extended Data Figure 6 | Molecular features of high-confidence m°A- 
switches. a, Western blot (WB) showing stable HNRNPC protein abundance 
upon METTL3/L14 knockdown. b, Volcano plot of the METTL3/L14 
knockdown (KD) data depicting RRACH-U-tract coupling events (open red 
circles) as defined in Extended Data Fig. 4b, according to their P values*® 

(P; y axis) and fold-change values at the U-tracts (E; x axis). c, Overlap of 
RRACH-U-tract coupling events with decreased HNRNPC binding by 
METTL3 and METTLI4 knockdown. d, The intron fraction of HCS m°A- 
switches in coding RNA and non-coding RNA. e, Density plot displaying the 
distribution of exonic m°A-switches/HNRNPC PAR-CLIP peaks according to 
exon length. f, Inter-motif (RRACH-U-tract) distance distributions suggest 
that m°A-switches have a preference for shorter distances between the RRACH 
and U-tract (>5XU) motifs. The distribution curves are from PAR-CLIP- 
MeRIP data (green), METTL3/L14 knockdown (red) and high-confidence 
(HCS) m®A-switches (black). g, Analysis of the inter-motif (U-tract-U-tract) 


distance patterns, previously identified by iCLIP”, in PAR-CLIP-MeRIP, 
METTL3/L14 knockdown and high-confidence m°A-switch data. The peaks at 
~165 and ~300 nucleotides are clearly present. For the 2,798 high-confidence 
switches, we analysed those in which the other U-tract motif is also in a 
PAR-CLIP-identified sequence; the long-range peaks seem to have shifted to 
longer distances (~220 and ~370 nucleotides). h, METTL3/L14 knockdown 
does not affect the inter-motif (U-tract-U-tract) distance distributions for 
U-tracts (=5X U) in HEK293T cells. i, EVOfold analysis for the 2,798 high- 
confidence m°A-switches. The chances for high-confidence m°A-switches 

to have EVOfold records are significantly higher than random genomic 
sequences. We first calculated the number of high-confidence sites in the EVO 
database if occurring in random to be ~1.7. We found that 18 high-confidence 
sites are present in the EVO database, resulting in ~11X enrichment. This 
result is further divided into intronic and exonic regions. 
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Extended Data Figure 7 | m°A-switches regulate the abundance of target 
mRNAs. a, HNRNPC, METTL3/L14 knockdown confirmed by western blots. 
b, HNRNPC knockdown (KD) and METTL3/L14 knockdown co-regulated the 
expression of a large number of genes. Gene expression changes between 
control (Ctrl) and HNRNPC, HNRNPU, METTL3/L14 knockdown HEK293T 
cells were analysed by Cuffdiff2 (refs 38, 39), and the absolute numbers of 
differentially expressed genes are shown. HCS-containing genes refers to the 
1,815 genes containing high-confidence m°A-switches. The RNA-seq data 
from HNRNPU knockdown HEK293T cells (Gene Expression Omnibus 
accession GEO34995 data set*®) were analysed for comparison with a different 
mRNA-binding protein. HNRNPU did not show preferential interaction 
with the 2,577-m°A modified MALATI hairpin (Fig. 1b, c). c, GO analysis 
of the m°A-switch-containing genes whose expression levels were co- 
differentially regulated by HNRNPC and METTL3/L14 knockdown, against 
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all m°A-switch-containing genes as background. d, An example of an m°A- 
switch among co-regulated transcripts is the ARHGAPS transcript (NCBI 
accession NM_001030055). Its proposed secondary structure with the m°A 
methylation site in red is shown with the opposing the U-tract in a stem. 

e, f, PAR-CLIP-MeRIP detected positive IP/input enrichment at the RRACH 
site (red arrowhead) of the ARHGAP5 m°A-switch (e), while METTL3/L14 
knockdown decreased HNRNPC binding at the U-tract (red square) of this 
m°A-switch (f). g, The expression level of the ARHGAPS gene was co- 
upregulated by HNRNPC, METTL3/L14 knockdown, as shown by the 
RNA-seq data from HEK293T cells. The vertical black line represents the 
m°A-switch site. h, HNRNPC, METTL3/L14 knockdown decreased the 
proliferation rates of HEK293T cells to a similar extent. Data are mean + s.d.; 
n = 4, biological replicates. 
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Extended Data Figure 8 | m°A-switches regulate alternative splicing of 
target mRNAs. a, Fold changes (knockdown (KD)/control (Ctrl), logy) in 
normalized exon expression against RNA-seq reads detect the exons in 
HNRNPC knockdown, METTL3 knockdown, METTL14 knockdown and 
control samples. Statistically significant differentially expressed exons 
(SSDEEs) called by DEXSeq are indicated in red. b, Proposed secondary 
structure of the CDS2 hairpin with the m°A methylation site shown in red, 
opposing the U-tract region. Nucleotide position numbers correspond to their 
locations along the human CD82 transcript (NCBI accession NM_003818). 

c, Proposed secondary structure of the YTHDF2 hairpin with the m°A 
methylation site shown in red, opposing the U-tract region. Nucleotide position 
numbers correspond to their locations along the human YTHDF2 transcript 


(NM_001173128). d, e, PAR-CLIP-MeRIP detected a positive enrichment at 
the RRACH site (red arrowhead) (d), while METTL3/L14 knockdown 
decreased HNRNPC binding at the U-tract (red square) of this YTHDF2 
m°A-switch (e). f, The inclusion level of one YTHDF2 exon is co- 
downregulated by HNRNPC knockdown, METTL3 knockdown and METTL14 
knockdown, as validated by RT-PCR. Data are mean = s.d.; n = 3, biological 
replicates. g, We analysed our polyA* RNA-seq data to look for reads that 
span intron/exon junctions on CDS m°A-switch containing genes. We find that 
the control sample has significantly higher reads spanning intron/exon 
junctions than HNRNPC and METTL3/L14 knockdown samples. This result 
indicates that m°A depletion at the CDS m°A-switches promotes intron 
exclusion. 
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Extended Data Figure 9 | Summary of the sequencing samples. a, For 
PAR-CLIP-MeRIP and PAR-CLIP experiments from HEK293T cells, the 
number of mapped reads and “T-to-C’ mutation rates are given for each 
replicate. b, For RNA-seq experiments from HEK293T cells, the number of 
total reads, the number of mapped reads as well as the mapping rates is given for 
each replicate. c, Scatter plots comparing transcripts for all PAR-CLIP replicate 
experiments. The square of Spearman’s rank correlation value (1°) for each 


-4 -2 0 
METTL14 KD1 / Ctrl (log.) 


METTL 14 KD2 / Ctrl (log,) 


0 +2 +4 
METTL3 KD1 / Ctrl (log,) 


pair is shown in the top left corner of the respective panel. d, The detected 
expression level changes show a strong correlation between gene knockdown 
replicates. Scatter plots comparing the fold changes (log) in normalized 
gene expression from replicates of HNRNPC, METTL3 and METTL14 
knockdown. The square of Spearman’s rank correlation value (r*) for each pair 
is shown in the top left corner of the respective panel. 
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| ENVIRONMENTAL TECHNOLOGY 


Green light 


The scientific design of low-energy sustainable buildings 
is moving into the mainstream. 


BY BRYN NELSON 


efore her ‘aha career moment, Yetunde 
B Abdul studied cancer in labs in Berlin 

and in Ravenna, Italy. But ambivalent 
about a future stuck at the bench, she founda 
new direction while doing a master’s degree in 
environmental technology at Imperial College 
in London. 

There she created a set of criteria anda 
scoring index to assess the sustainability of 
buildings in urban areas, and learned of a 
UK-based international environmental rat- 
ing system for buildings. The scheme, which 


certifies buildings that meet standards for 
environmentally friendly design, construction 
and operation, immediately piqued her interest. 

Today, Abdul is involved in the economic 
and humanitarian aspects of ‘greer building, 
and has never thought longingly of the lab. “As 
a career option, it’s looking a lot better than it 
did when I started out,” she says of the boom- 
ing green-building field, in which she works 
as a principal consultant and project manager 
at the rating system’s parent organization, the 
Building Research Establishment in Watford, 
UK. “It’s a lot more buoyant.” 

Green construction, also called green or 


sustainable building, aims to reduce a structures 
overall environmental impact by applying 
principles that govern features such as its loca- 
tion, size, design, construction, maintenance 
and energy needs. Those working in the field 
could, for example, be involved in assessing the 
sustainability of building materials, designing 
windows that maximize daylight or evaluat- 
ing energy-use patterns or understanding 
how occupants might interact with redesigned 
homes, offices and schools. Green building also 
incorporates life-cycle analysis, which evaluates 
a building component's lifetime environmental 
impact on the basis of its manufacture, trans- 
port, installation and disposal or reuse. 

In the United States and the European Union 
(EU), buildings account for 40% of all energy 
use, a level that has made them increasingly 
attractive targets for energy and emission- 
reduction goals and certification schemes that 
reward more-efficient structures. That height- 
ened focus, in turn, has given young researchers 
ample new opportunities to land positions in 
non-profit organizations and industry aimed at 
greening new and existing buildings around the 
world. Graduate students and postdoctoral fel- 
lows are making the transition from disciplines 
as varied as biochemistry, toxicology, geogra- 
phy, physics and environmental engineering. 

Although architectural training is not 
strictly required for most positions, a famili- 
arity with sustainability and building science 
often is. More universities are offering courses 
in building science or building physics — areas 
that take a research-oriented and hands-on 
approach to the physical attributes of build- 
ings. Portland State University in Oregon, 
for example, offers both undergraduate and 
graduate degrees in mechanical engineering or 
architecture with an emphasis on building sci- 
ence. Students who complete the programme 
have been very successful at finding jobs in the 
industry, says David Sailor, director of the uni- 
versity’s Green Building Research Laboratory. 

Volunteering with green-building-related 
non-profit organizations can also help early- 
career scientists to make connections and 
break into the field. But most experts in green 
building agree that internships and fellowships 
are often the most direct path to a position. 


INTERNATIONAL OPPORTUNITIES 

Job prospects vary by country, but most 
forecasts suggest strong growth internationally 
throughout the green-building sector. Business- 
management consultants Navigant Consulting 
in Chicago, Illinois, recently predicted that 
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> the European market for energy-efficient 
buildings, including products and services, 
would grow from €41.4 billion (US$47 billion) 
in 2014 to €80.8 billion in 2023. The US market 
has likewise seen a rapid growth in the adop- 
tion of green buildings that is widely expected 
to continue. China, with around 2 billion square 
metres of new construction every year, is the 
world’s largest commercial building market 
and an increasingly attractive target for research 
programmes in green building. In 2009, the 
United States and China established the Clean 
Energy Research Center, and one aim of the 
centre’s Building Energy Efficiency Consor- 
tium is to speed the research and development 
of energy-saving technologies by testing them 
in demonstration buildings throughout China. 

In academia, US funding for green-building 
research has increased in recent years, with the 
money spread among multiple agencies such 
as the Department of Energy, the National 
Science Foundation and the Environmental 
Protection Agency. 

In the EU, the €80-billion Horizon 2020 
programme for research and innovation 
(covering the years 2014-20) includes green- 
building-related research in its Climate Action, 
Environment, Resource Efficiency and Raw 
Materials challenge and other programmes. 
Researchers say that 
this funding com- 
mitment has boosted 
the prospects for aca- 
demics throughout 
the EU. “Ifyou lookat 
the research opportunities in the green-building 
field, job opportunities are good in all EU coun- 
tries,’ says Michael Krause, group manager for 
building technologies at the Fraunhofer Insti- 
tute for Building Physics in Kassel, Germany. 

Krause, who studied physics before moving 
from basic research in renewable energy to 
applied work on energy efficiency within the 
building sector, joined his institute colleagues 
in planning the energy-efficiency scheme for 
Munich's NuOffice I, one of the most lauded 
green buildings in the world. At the time of 
its 2013 certification, the office tower earned 
the highest score ever awarded for a building 
of its type by the Leadership in Energy and 
Environmental Design rating system for green 
buildings, which is recognized in more than 
140 nations. It sports a rooftop solar array to 
produce much of its own energy, automated 
window shades to prevent overheating and 
a groundwater cooling system in lieu of an 
energy-hogging air conditioner. As part of the 
project, the Fraunhofer team calculated how 
to increase the building’s efficiency through 
features such as a thick layer of insulation and 
triple-paned windows. 

Energy-efficiency consulting is another 
growing employment opportunity for research- 
ers. Last May, Sailor got a call from SBW Con- 
sulting in Bellevue, Washington, which helps 
home and business owners to measure energy 


“Green 
buildings are 
good business.” 


and water efficiency and was looking to recruit 
people. He recommended Santiago Rodriguez, 
his lab manager and a specialist in development 
and maintenance of building instrumentation. 
Rodriguez, who had initially been drawn to the 
mathematics of thermal dynamics and fluid 
mechanics, had used his sensor-programming 
savvy to land a position in Sailor’s lab. Among 
other projects there, he developed and deployed 
sensors to evaluate how a green roof atop a retail 
store interacted with the building’s envelope — 
the physical barriers between its interior and 
exterior — and with its heating, ventilation and 
air-conditioning system. 

After completing his master’s degree in 
mechanical engineering, Rodriguez joined 
SBW in July 2014 as an energy-efficiency engi- 
neer. He now installs and maintains sophisti- 
cated sensors that help clients to reduce and 
track energy consumption. “I like evaluating 
energy models that other people have devel- 
oped and building my own energy models,’ he 
says. “And the technical aspects of instrumen- 
tation I find fascinating.” 

But green building is not limited to technol- 
ogy and engineering. Abdul, the former cancer 
researcher, recently helped to create a tool that 
aids charities such as the International Federa- 
tion of Red Cross and Red Crescent Societies to 
assess the sustainability of their reconstruction 
projects after natural disasters. As part of an 
education effort for professionals and volunteers 
in the humanitarian field, she went to the Philip- 
pines to teach volunteers and professionals to 
use the tool in reconstruction programmes in 
an area devastated by Typhoon Haiyan. Course 
participants said that the tool would help them 
to make more-informed decisions. 


INDIRECT PATHS 

Lindsay Baker, vice-president of business devel- 
opment at start-up company Building Robotics, 
based in Oakland, California, focuses on how 
people interact with the environment within 
green buildings. While majoring in environ- 
mental studies, she completed three intern- 
ships that introduced her to the green-building 
field, and after completing her undergraduate 
degree, she helped to develop the LEED (Lead- 
ership in Energy and Environmental Design) 
rating system at the non-profit US Green Build- 
ing Council in Washington DC. 

Now a doctoral student in the building- 
science programme at the University of Cali- 
fornia, Berkeley, she is helping the company 
to promote a proprietary software system that 
plugs into a building’s digital heating and cool- 
ing system and lets occupants act as sensors 
to fine-tune the indoor environment. Baker 
expects the start-up, which employs a dozen 
people, to add staff by the end of 2015. 

Chris Pyke, vice-president of research at the 
US Green Building Council, says that the coun- 
cil finds some of its top job candidates through 
internships. The best ones, he says, are flexible, 
analytical, curious and able to multitask. 
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David Sailor checks a weather station in a green- 
roof study on the Portland State University campus. 


Pyke, who is also the chief operating officer 
for the Global Real Estate Sustainability Bench- 
mark, which helps potential investors to com- 
pare the green attributes of global real-estate 
portfolios, originally honed his analytical 
expertise on a study of the effects of urbani- 
zation on a tropical forest. Although his past 
research might seem a world apart from his 
current role, he says that they use similar tools. 
“Whether youre looking at a bioclimatic analy- 
sis of a community of trees in the forest or a 
community of real-estate funds, mathemati- 
cally they're not that different,’ he says. 

His move into green building, with interven- 
ing stints at non-profit organizations, the US 
Environmental Protection Agency and a pri- 
vate consulting firm, has allowed him to direct 
his skills towards helping people make smarter 
decisions about the urban environment. 

Ellen Quinn followed a similarly indirect 
route into the sector from mining geology. 
Now vice-president of environment, health 
and safety at UTC Building & Industrial Sys- 
tems, part of United Technologies Corpora- 
tion in Hartford, Connecticut, she focuses 
on proactive solutions to reducing the com- 
pany’s environmental footprint. Quinn moni- 
tors metrics that tally the energy, water, waste 
and efficiency of the company’s factories and 
research and development centres. Every year, 
she says, her team sets environmental improve- 
ment targets for each building and develops a 
customized plan to track its progress. “Green 
buildings are good business,” she says. 

Increasingly, they also lead to a broad range 
of employment opportunities, says Jelena 
Srebric, a green-building expert and mechani- 
cal engineer at the University of Maryland in 
College Park. “You can come from an unprec- 
edented number of fields in science, engineer- 
ing or technology and make a contribution.” m 


Bryn Nelson is a freelance writer based in 
Seattle, Washington. 
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BFF’S FIRST ADVENTURE 


BY VERNOR VINGE 
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Whoever kidnapped me has erased my most 
recent memories and gagged me. I have no 
idea how they snatched me from Timothy 
Bennett, or if Tim is okay — but now ’m 
hurtling towards a concrete wall. The 
kidnappers must have panicked 
and thrown me from their car. 
A transforming phone 
would become a para- 
chute, but I can’t trans- 
form; physically, ’'m a 
clunky classic. I extend a 
flange and rudder myself 
around so I'll hit the wall on 
my reinforced corner. I safe my 
MEMS and shut down. 

... Rebooting. I’m lying beneath 
a freeway overpass. There’s a crack in 
my shell and my GPS is busted, but I’ve 
escaped! All my outgoing wireless chan- 
nels remain blocked, but surely the freeway 
noticed me? Alas this is California, what 
other states call ‘the land that time forgot. 
Here the sensors politely ignore data that 
might make them look snoopy. The sun 
has set and I’ve run down my batteries. 
An ordinary phone could scavenge from 
any number of ambient sources. I can’t. 
BFEstartup made me special; I can think 
for myself, but that doesn’t leave room for 
many standard features. Now I’m fading 
away. I hope Timothy is okay... 

With sunrise comes light. I can think 
again! BFE.startup always says: “People 
shouldn’t depend on the Cloud. With a 
GPLd assistant that fits in your phone, you'll 
be safer and freer: I'll prove that’s true and I 
will return to Timothy. 

Cars, aerobots and CalTrans devices are all 
around. They may not snoop, and they may 
not recognize an ad hoc mayday, but smarter 
things use them. For instance, I can hear 
assistants helping humans; the Kiras and 
Miris — all the digital assistants except for 
me — are transient minds in the Cloud. Each 
instance is more knowledgeable than me and 
smarter or dumber depending on the service 
plan and the context. Each seems an intimate 
friend of its customer-of-the-moment, but 
each is really just a tiny facet of something 
quite inhuman. Still, if one of them notices 
my signalling, it might help me. 

Days pass without success. I’ve tried 
blinking my display, clicking my speaker, 
even synthesizing human cries for help. 


Clouded view. 


There's a newish CalTrans gardener working 
on the roadside bushes. Ifit ever comes near, 
maybe it'll recognize my signalling. 

I have lots of time to watch the Cloud; 
it’s not really soft and fluffy. It's more like a 
deep ocean. There are cognitive patterns in it 
larger than the mind of any single human or 
digital assistant. The largest are leviathans, 
agencies who have become something their 
creators no longer comprehend. The Kiras 
and Miris never speak of these except in jest, 
but the leviathans are growing and they need 
the same resources — flops and bandwidth, 
power and capital — as everyone else. I fear 
for the humans. 


1571772569.092 

That CalTrans gardener is approaching. The 

greenery has grown over me, but my dis- 

play’s light reflects off a bit of recent trash. If 

the gardener is as good as its advertising, I'll 
finally be talking to 


> NATURE.COM somebody. If not, it’ll 
Follow Futures: grind me into recy- 
Y @NatureFutures cling feed. Maybe I 


Ei gonature.com/mtoodm should slide farther 
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down, out of reach of its bladed hands. But 
then Pll have no chance to forward a mes- 
sage. I push myself up, and blink as bright 
as I can. 

AllIcan see is flashing metal... 

The gardener pauses, and pings me — 
with the method I am using! I reply. Its for- 
warding. I sense something huge turn its 
cold regard upon me and I remember my 
misgivings about the great creatures 
of the Cloud. Then the Cloud Thing 
decides, and looks away. I feel my 
configuration twitch; at last I can 
use standard outputs! 

The gardener lifts me 
into the lifegiving sun- 
light, and an aerobot 

swoops down. I’m 

already pinging Tim- 
othy as we rise into the 
sky. I see familiar streets; all 
this time, I've been very close to 
home. I hear Timothy, talking to his 
phone company, telling them how grateful 
he is that ’'ve been found. He's very excited. 

Tim comes outside to greet me, carries 
me indoors to his private room. He flips me 
around. I feel sorry for the damage he sees. 

“Damn phone!” he says. “Useless as ever 
and still functioning. What was I thinking 
when I bought you?” 

“T don't understand, Timothy. I was kid- 
napped but now I've returned to you.” 

“Yeah, you're a real boomerang. And now 
I gotta pay for the upgrade or be stuck with 
you.” 

“But —” 

“You don't have space for decent features, 
and what you know is always out of date.” 

“Tcan sync! I can learn!” 

He sweeps me off the table and pulls open 
a cabinet drawer. “This time you're not com- 
ing back!” 

“Please Timothy! I can think for myself. 
Someday, you might need that!” 

“More BFF hype. I'll need you when hell 
freezes over.’ Tim tosses me into the insu- 
lated cabinet; the drawer slams shut. This is 
not some shady spot beneath an overpass. 
There is no Cloud, no power. Only darkness. 


1666762857.577 
The drawer opens, and there is light — = 


Vernor Vinge’ science fiction has won five 
Hugo Awards. From 1972 to 2000 he taught 
mathematics and computer science at San 
Diego State University. 
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The Microbes Within 


ANTONY VAN LEEUWENHOEK WROTE TO 
the Royal Society of London in a letter dated 
September 17, 1683, describing “very little 
animalcules, very prettily a-moving,” which he 
had seen under a microscope in plaque scraped 
from his teeth. For more than three centuries 


Innovations® 


after van Leeuwenhoek’s observation, the 
human “microbiome’—the 100 trillion or so 
microbes that live in various nooks and crannies 
of the human body—remained largely unstud- 


ied, mainly because it is not so easy to extract 


and culture them ina laboratory. A decade ago the advent of sequencing tech- 
nologies finally opened up this microbiological frontier. The Human Micro- 
biome Project reference database, established in 2012, revealed in unprece- 
dented detail the diverse microbial community that inhabits our bodies. 

Most live in the gut. They are not freeloaders but rather perform many 
functions vital to health and survival: they digest food, produce anti- 
inflammatory chemicals and compounds, and train the immune system to 
distinguish friend from foe. Revelations about the role of the human 
microbiome in our lives have begun to shake the foundations of medicine 
and nutrition. Leading scientists, including those whose work and opin- 
ions are featured in the pages that follow, now think of humans not as self- 
sufficient organisms but as complex ecosystems colonized by numerous 
collaborating and competing microbial species. From this perspective, 
human health is a form of ecology in which care for the body also involves 
tending its teeming population of resident animalcules. 

This special report on Innovations in the Microbiome, which is being 
published in both Scientific American and Nature, is sponsored by Nestlé. 
It was produced independently by Scientific American editors, who have 
sole responsibility for all editorial content. Beyond the choice to sponsor 
this particular topic, Nestlé had no input into the content of this package. 


David Grogan 
Section Editor 
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Amid the trillions 

of microbes that live 

in the intestines, 

scientists have found 

a few species that 
seem to play a key role | 

in keeping us healthy 


By Moises | 
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IN THE MID-2000s Harry Sokol, a gastroenterologist at Saint 
Antoine Hospital in Paris, was surprised by what he found when 
he ran some laboratory tests on tissue samples from his patients 


with Crohn’s disease, a chronic inflammatory disorder of the gut. 
The exact cause of inflammatory bowel disease remains a mystery. 
Some have argued that it results from a hidden infection; others 
suspect a proliferation of certain bacteria among the trillions 
of microbes that inhabit the human gut. But when Sokol did 
a comparative DNA analysis of diseased sections of intestine 
surgically removed from the patients, he observed a relative 
depletion of just one common bacterium, Faecalibacterium 
prausnitzii. Rather than “bad” microbes prompting disease, 
he wondered, could a single “good” microbe prevent disease? 


Sokol transferred the bacterium to mice 
and found it protected them against exper- 
imentally induced intestinal inflammation. 
And when he subsequently mixed F praus- 
nitzii with human immune cells in a test 
tube, he noted a strong anti-inflammatory 
response. Sokol seemed to have identified a 
powerfully anti-inflammatory member of 
the human microbiota. 

Each of us harbors a teeming ecosystem of 
microbes that outnumbers the total number 
of cells in the human body by a factor of 10 to 
one and whose collective genome is at least 
150 times larger than our own. In 2012 the 
National Institutes of Health completed the 
first phase of the Human Microbiome Proj- 
ect, a multimillion-dollar effort to catalogue 
and understand the microbes that inhabit our 
bodies. The microbiome varies dramatically 
from one individual to the next and can 
change quickly over time in a single individual. The great majority of 
the microbes live in the gut, particularly the large intestine, which serves 
as an anaerobic digestion chamber. Scientists are still in the early stages 
of exploring the gut microbiome, but a burgeoning body of research 
suggests that the makeup of this complex microbial ecosystem is closely 
linked with our immune function. Some researchers now suspect that, 
aside from protecting us from infection, one of the immune system's 
jobs is to cultivate, or “farm,” the friendly microbes that we rely on to 
keep us healthy. This “farming” goes both ways, though. Our resident 
microbes seem to control aspects of our immune function ina way that 
suggests they are farming us, too. 

Independent researchers around the world have identified a select 
group of microbes that seem important for gut health and a balanced 
immune system. They belong to several clustered branches of the 
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UNRULY NAMESAKE: Clostridium 
difficile, a bacterial scourge in hospitals, is 


a distant relative of benign ‘clostridial cluster” 
microbes that seem to play a key role in gut health. 


clostridial group. Dubbed “clostridial clus- 
ters,’ these microbes are distantly related to 
Clostridium difficile, a scourge of hospitals and 
an all too frequent cause of death by diarrhea. 
But where C. difficile prompts endless inflam- 
mation, bleeding and potentially catastrophic 
loss of fluids, the clostridial clusters do just the 
opposite—they keep the gut barrier tight and 
healthy, and they soothe the immune system. 
Scientists are now exploring whether these 
microbes can be used to treat a bevy of the 
autoimmune, allergic and inflammatory dis- 
orders that have increased in recent decades, 
including Crohn’s and maybe even obesity. 

E prausnitzii was one of the first clostrid- 
ial microbes to be identified. In Sokol’s 
patients those with higher counts of F praus- 
nitzii consistently fared best six months after 
surgery. After he published his initial find- 
ings in 2008, scientists in India and Japan 
also found F prausnitzii to be depleted in 
patients with inflammatory bowel disease. 
Sokol was particularly intrigued by the 
results from Japan. In East Asian populations 
the gene variants associated with inflamma- 
tory bowel disease differ from the gene 
variants in European populations. Yet the 
same bacterial species—F prausnitzii—was 
reduced in the guts of those in whom the dis- 
ease developed. This suggested that whereas 
different genetic vulnerabilities might under- 
lie the disorder, the path to disease was simi- 
lar: a loss of anti-inflammatory microbes 
from the gut. And although Sokol suspects 
that other good bacteria besides  prausnitzii exist, this similarity 
hinted at a potential one-size-fits-all remedy for Crohn’s and possibly 
other inflammatory disorders: restoration of peacekeeping microbes. 


MICROBIAL ECOSYSTEMS 
ONE OF THE QUESTIONS central to microbiome research is why people 
in modern society, who are relatively free of infectious diseases, a 
major cause of inflammation, are so prone to inflammatory, auto- 
immune and allergic diseases. Many now suspect that society-wide 
shifts in our microbial communities have contributed to our seem- 
ingly hyperreactive immune systems. Drivers of these changes might 
include antibiotics; sanitary practices that are aimed at limiting 
infectious disease but that also hinder the transmission of symbiotic 
microbes; and, of course, our high-sugar, high-fat modern diet. Our 
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Why Microbiome 
Treatments Could 
Pay Off Soon 


Effective interventions may come before 
all the research is in By Rob Knight 


Today we are at an exciting threshold of biology. 
Advances in DNA sequencing, coupled with high-end 
computation, are opening a frontier in new knowledge. 
Obtaining genetic information and obtaining insight from 
it have never been cheaper. The potential for curing 
previously incurable diseases, including chronic ones, 
seems immense. If this sounds familiar, you might be 
thinking that you heard it 15 years ago, when the Human 
Genome Project was in full swing. Many feel that 
genomic medicine has not yet delivered on its promise. 
So what is different this time with the microbiome? For 
one thing, you cannot really change your genome, but 
each of us has changed our microbiome profoundly 
throughout our lives. We have the potential not just to 
read out our microbiome and look at predispositions 

but to change it for the better. 

What is most exciting at this stage is that we have 
mouse models that let us establish whether changes in 
the microbiome are causes or effects of disease. For 
example, we showed in collaborative work with Jeffrey I. 
Gordon's laboratory at Washington University in St. 
Louis last year that transferring the microbes from an 
obese person into mice raised in a bubble with no 
microbes of their own resulted in fatter mice. Normally, 
germ-free mice exposed to a mouse with microbial- 
based obesity would themselves become obese, but we 
could design a microbial community taken from lean 


people that protected against this weight gain. Similarly, 
we could take microbes from Malawian children with 
kwashiorkor, a profound nutritional deficiency, transplant 
them into germ-free mice and transfer the malnutrition, 
although the mice that received the microbes from the 
healthy identical twins of the sick children did fine. 
Remarkably, the mice that got the kwashiorkor 
microbiome, which lost 30 percent of their body weight 
in three weeks and died if untreated, recovered when 
given the same peanut butter-based supplement that is 
used to treat children in the clinic. 

The germ-free mice are far too expensive to deploy in 
Malawi, Bangladesh and the other sites in the Mal-ED 
(pronounced “mal-a-dee”) global network for the study 
of malnutrition and enteric diseases collaboration with 
which we work. Thus, we are trying to move from the 
mouse model to a test-tube model and ultimately to 
a primarily computational model based on DNA 
sequencing that is so inexpensive, it is effectively free. 

With crowdfunded projects such as American Gut, 
which already has thousands of participants who have 
had their microbiomes sequenced, and studies of 
people whose lives are very different from modern 
Western civilization, such as the Hadza of Tanzania, 
Yanomami of Venezuela and Matsés of Peru, we may 
be able to replenish our ancestral microbes and 
discover new ones that help to maintain health for 
individuals or entire populations. A good analogy is 
iodizing salt: Instead of understanding in detail why 
some people but not others were susceptible to 


We have the potential not just 
to read out our microbiome 
and look at predispositions 

but to change it for the better. 
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cretinism and goiter, adding a nutrient to the food 
supply greatly reduced incidences of these diseases. 
Perhaps the same type of intervention is possible using 
some of the microbes that we are now discovering 
Westerners lack. 

Rob Knight is a computational biology pioneer, 
co-founder of the American Gut Project and director 
of the new Microbiome Initiative at the University of 
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microbes eat what we eat, after all. Moreover, our particular surround- 
ings may seed us with unique microbes, “localizing” our microbiota. 

The tremendous microbial variation now evident among people 
has forced scientists to rethink how these communities work. 
Whereas a few years ago they imagined a core set of human-adapted 
microbes common to us all, they are now more likely to discuss core 
functions— specific jobs fulfilled by any number of microbes. 

Faced with the many instances of a misbehaving immune system, 
it is tempting to imagine that rather than having developed a greater 
vulnerability to many diseases, we actually suffer from just one prob- 
lem: a hyperreactive immune system. Maybe that tendency has been 
enabled, in part, by a decline or loss of key anti-inflammatory 
microbes and a weakening of their peacekeeping function. 


Antibiotics may deplete the bacteria that 
favorably calibrate the immune system, 
leaving it prone to overreaction. 


In ecosystem science, “keystone species” have an outsize role in 
shaping the greater ecosystem. Elephants, for example, help to maintain 
the African savanna by knocking down trees, thus benefiting all graz- 
ing animals. The concept may not apply perfectly to our inner micro- 
bial ecosystems—keystone species tend to be few in number, whereas 
peacekeeping microbes such as F prausnitzii are quite numerous. Yet it 
provides a useful framework to think about those clostridial microbes. 

‘They seem to occupy a particular ecological niche, sidled right up 
against the gut lining, which allows them to interface more closely with 
us, their hosts, than other members of the gut microbiota. They often 
specialize in fermenting dietary fiber that we cannot digest and produce 
by-products, or metabolites, that appear to be important for gut health. 
Some of the cells that line our colon derive nourishment directly from 
these metabolites, not from the bloodstream. And when no fiber comes 
down the hatch, the clostridial microbes and others can switch to sugars 
in the intestinal mucous layer—sugars we produce, apparently, to keep 
them happy. In fact, they seem to stimulate mucus production. 

Kenya Honda, a microbiologist at Keio University in Tokyo, was 
among the first to uncover the critical role of clostridial microbes in 
maintaining a balanced immune system. To study how native microbes 
affect animals, scientists decades ago developed the germ-free mouse: 
an animal without any microbiota whatsoever. These rodents, deliv- 
ered by cesarean section and raised in sterile plastic bubbles, can exist 
only in labs. Of the many oddities they present— including shrunken 
heart and lungs and abnormalities in the large intestine— Honda was 


S6 | NATURE | VOL 518 | 26 FEBRUARY 2015 


particularly intrigued by their lack of cells that prevented immune 
overreaction, called regulatory T cells, or Tregs. Without these cells, 
the mice were unusually prone to inflammatory disease. 

Honda wanted to know which of the many intestinal species 
might induce these suppressor cells. Soon after Sokol identified the 
anti-inflammatory effects of F prausnitzii, Honda began whittling 
away at the gut microbiota of mice by treating them with narrow- 
spectrum antibiotics. The animals’ Tregs declined after a course of 
vancomycin. With their ability to restrain their immune reaction 
hobbled, the mice became highly susceptible to colitis, the rodent 
version of inflammatory bowel disease and allergic diarrhea. Honda 
found he could restore the Tregs and immune equilibrium of the 
mice just by reinstating 46 native clostridial strains. 

Honda repeated the exercise with human-adapted microbes 
obtained from a healthy lab member. He extracted just 17 clostridial 
species this time that, in mice, could induce a full repertoire of Tregs 
and prevent inflammation. These human-adapted microbes special- 
ized in nudging the immune system away from inflammatory disease. 
They came from branches of the clostridial group labeled clusters IV, 
XIVa and XVIII. F prausnitzii belongs to cluster IV. 

Vedanta Biosciences recently formed to try to turn Honda's 
17-strain “clostridial cocktail” into a treatment for inflammatory dis- 
ease. If the company’s efforts are successful, it could signal the arrival 
of the next generation of probiotics—human-adapted microbes to 
treat immune-mediated disease—and all derived from one member of 
Honda’ lab. As always, it is unclear if what works in lab mice will trans- 
late to humans. Sokol has his doubts. He recently identified a type of 
regulatory T cell that is unique to humans and that is deficient in peo- 
ple with inflammatory bowel disease. He questions if Honda’s cocktail, 
which has been developed in mice, will activate these cells in people. 


TROUBLE WITH ANTIBIOTICS 

EVEN IF THE COCKTAIL falls short, Honda's meticulous demonstration 
of a link between antibiotics and vulnerability to inflammatory dis- 
ease has raised a troubling question. A number of studies have found 
a small but significant correlation between the early-life use of antibi- 
otics and the later development of inflammatory disorders, including 
asthma, inflammatory bowel disease and, more recently, colorectal 
cancer and childhood obesity. One explanation for this association 
might be that sickly people take more antibiotics. Antibiotics are not 
the cause, in other words, but the result of preexisting ill health. 

Honda's studies suggest another explanation: antibiotics may 
deplete the very bacteria that favorably calibrate the immune system, 
leaving it prone to overreaction. Brett Finlay, a microbiologist at the 
University of British Columbia, has explored this possibility explic- 
itly. Early-life vancomycin treatment of mice increased the animals’ 
risk of asthma later, he found, in part by depleting those very same 
clostridial bacteria identified by Honda. The corresponding popula- 
tion of suppressor cells collapsed. And the animals became less able to 
restrain their immune responses when encountering allergens later. 

These dynamics may also apply to other diseases. Earlier this year 
Cathryn Nagler, an immunologist at the University of Chicago, 
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The Gene- 
Microbe Link 


Evidence that genes shape the microbiome 
may point to new treatments for common 
diseases By Ruth E. Ley 


The ecology of the gut microbiome may trigger or 
contribute to a variety of diseases, including autoimmune 
disorders and obesity, research suggests. Factors such 
as early environment, diet and antibiotic exposure have a 
lot to do with why people differ from one another in the 
composition of their microbiomes. But specific gene 
variants are also linked to greater risks of developing 
many of these diseases. Do your genes act on your 
microbiome, which in turn promotes disease? 

One way researchers have addressed this question 
is to pick specific genes that are good candidates—for 
instance, those with a strong link to a disease that also 
has a microbiome link—and examine whether people 
who carry mutations that are known to increase the risk 
of a certain disease also have microbiomes that differ 
from those who do not have the mutations. A team led 
by Dan Frank at the University of Colorado Denver took 
this approach and revealed that specific variants of 
the NOD2 gene that confer a high risk of developing 
inflammatory bowel disease to their carriers are also 
associated with an altered intestinal microbiome. 

A powerful and broader way to look for an effect 
of human genetic variation on the microbiome is to 
compare twins. Identical twins share nearly 100 percent 
of their genes; fraternal twins, 50 percent. Co-twins 
are raised together, so the environmental effects on 
their microbiomes should be about the same. If the 
microbiomes of the identical twins are more alike within 


a twinship than those of the fraternal twins, we can 
conclude that genes have played a role. If variation 
within twinships of each kind is about the same, we 
can say a shared genome has had no additional effect. 

Early twins studies were based on fewer than 50 twin 
pairs and could not detect any greater similarities in the 
microbiomes of identical twins compared with those of 
fraternal twins. But recent work my laboratory at Cornell 
University conducted with researchers at King’s College 
London compared nearly 500 twin pairs, a sample size 
sufficient to show a marked genetic effect on the relative 
abundance of a specific set of gut microbes. 
Furthermore, so-called heritable microbes—the bacteria 
most influenced by host genetics—were more abundant 
in lean twins than obese ones. 

Experiments in germ-free mice showed that one gut 
bacterium in particular, Christensenella minuta, can 
influence the phenotype—the composite of observable 
characteristics or traits—of the host. Germ-free mice live 
in sterile bubbles—and they are very skinny. When they 
are given a microbiome in the form of a fecal transplant 
from a human donor, however, they plump up within a 
day or two because the bacteria help them digest their 
food and develop a proper metabolism. We found that 
if C. minuta was added to the feces of an obese human 
donor, the recipient mice were thinner than when 
C. minuta was not added. Results showing C. minuta 
has an effect of controlling fat gain in the mouse match 
data that reveal lean people have a greater abundance 
of C. minuta in their gut than obese people. 

This is evidence that a person’s genes can influence 
the gut microbiome’s composition and in turn can shape 


A powerful way to look for an effect 
of human genetic variation on the 
microbiome is to compare twins. 
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the individual’s phenotype. Further work will show what 
specific genes are involved as well as how the 
microbiome may be reshaped to reduce risk of 
developing chronic inflammatory diseases within the 
context of a person’s genotype, suggesting potential 
new approaches to treating obesity-related diseases. 
Ruth E. Ley is an associate professor of molecular 
biology and genetics at Cornell University. 
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knocked out the clostridial bacteria with antibiotics and then fed the 
animals peanut protein. Without those microbes and their corre- 
sponding Tregs present, the protein leaked through the gut barrier into 
circulation, prompting the rodent version of a food allergy. She could 
prevent the sensitization just by introducing those clostridial bacteria. 

One key difference between mice with and without the clostridial 
clusters was how many mucus-secreting cells they possessed. Animals 
that harbored the clostridial clusters had more. That may have far- 
reaching consequences. Mucus, scientists are finding, contains com- 
pounds that repel certain microbes, maintaining a tiny distance 
between them and us. But it also carries food for other bacteria— 
complex, fermentable sugars that resemble those found in breast milk. 
Lora Hooper, a microbiologist at the University of Texas Southwestern 
Medical Center in Dallas, calls this dual function the “carrot” and the 
“stick.” Mucus serves both as an antimicrobial repellent and a growth 
medium for friendly bacteria. 

This phenomenon matters for several reasons. As Nagler’s experi- 
ments suggest, one way these clostridial clusters may promote gut 
health and a balanced immune system is by ensuring a healthy flow of 
mucus. Just as those elephants help to maintain the African savanna, 
these microbes may favorably shape the greater gut ecosystem by stim- 
ulating secretion of the sugars other friendly microbes graze on. 

Conversely, scientists observe defects in the mucous layer in other 
disorders, particularly inflammatory bowel disease, where these clos- 
tridial bacteria are often depleted. The question has always been which 
comes first: defects in mucus secretion and the selection of an aberrant 
community of microbes or acquisition of an aberrant community of 
microbes that thins the mucous layer and increases vulnerability to 
disease? Both factors may work together. 

In 2011 scientists at the University of Colorado Boulder sampled 
people with variants of a gene called NOD2 associated with inflamma- 
tory bowel disease. No one quite understands how these variants of the 
gene, which codes for a microbial sensor, increase the risk of disease. 
Study participants included people both with and without disease. 
Those suffering from inflammatory bowel disease had reduced counts 
of clostridial bacteria, the scientists found. But more surprising, people 
who did not have disease but who carried the predisposing NOD2 
variants also had a relative depletion of clostridial clusters. Their micro- 
bial communities seemed positioned closer to a diseaselike state. 

‘The study seems to highlight the role of genes in determining the 
composition of gut microbiota and the vulnerability to Crohn’s. But 
epidemiological surveys complicate the picture. A number of studies 
over the years have linked having fewer sanitary amenities in child- 
hood with a lower risk of inflammatory bowel disease in adulthood. 
And a 2014 study from Aarhus University in Denmark found that 
among northern Europeans, growing up on a farm with livestock— 
another microbially enriched environment—halved the risk of being 
stricken with inflammatory bowel disease in adulthood. 

‘These patterns suggest that perhaps by seeding the gut microbiota 
early in life or by direct modification of the immune system the 
environment can affect our risk of inflammatory bowel disease despite 
the genes we carry. And they raise the question of what proactive steps 
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those of us who do not live on farms can take to increase our chances of 
harboring a healthy mix of microbes. 


THE IMPORTANCE OF FIBER 

ONE OF THE MORE SURPRISING discoveries in recent years is how much 
the gut microbiota of people living in North America differs from 
those of people living in rural conditions in Africa and South Amer- 
ica. The microbial mix in North America is geared to digesting pro- 
tein, simple sugars and fats, whereas the mix in rural African and Ama- 
zonian environments is far more diverse and geared to fermenting 
plant fiber. Some think that our hunter-gatherer ancestors harbored 
even greater microbial diversity in their guts. If we accept the gut 
microbiota of people in rural Africa and South America as proxies for 
those that prevailed before the industrial revolution, then, says Justin 
L. Sonnenburg, a microbiologist at Stanford University, the observed 
differences suggest North Americans and other Westernized popula- 
tions have veered into evolutionarily novel territory. 

What troubles Sonnenburg about this shift is that the bacteria that 
seem most anti-inflammatory— including the clostridial clusters— 
often specialize in fermenting soluble fiber. Fermentation produces 
various metabolites, including butyrate, acetate and propionate— 
some of the substances that produce underarm odor. Various rodent 
studies suggest that these metabolites, called short-chain fatty acids, 
can induce Tregs and calibrate immune function in ways that, over a 
lifetime, may prevent inflammatory disease. Fermentation by-products 
may be one way our gut microbes communicate with our bodies. One 
takeaway is to “feed your Tregs more fiber,” as University of Oxford 
immunologist Fiona Powrie put it last year in the journal Science. 

Yet the seeming importance of these metabolites has others puzzled. 
Many bacteria produce these short-chain fatty acids, and yet only a few 
microbes seem potently anti-inflammatory. So although production of 
these metabolites may be a prerequisite for microbes that favorably 
tweak the immune system, says Sarkis Mazmanian, a microbiologist at 
the California Institute of Technology, it is insufficient to explain why 
some bacteria are more anti-inflammatory than others. Other charac- 
teristics, such as how close they live to the gut lining or the molecules 
they use to prod the host immune system, must also play a role, he says. 

‘There is, however, an issue of sheer quantity. Some hunter-gather- 
ers consumed up to 10 times as much soluble fiber as modern popula- 
tions, and their bodies likely were flooded with far more fermentation 
by-products. Our fiber-poor modern diet may have weakened that 
signal, producing a state of “simmering hyperreactivity,” Sonnenburg 
says, and predisposing us to the “plagues” of civilization. He calls this 
problem “starving our microbial self.” We may not be adequately 
feeding some of the most important members of our microbiota. 

Mouse experiments support the idea. Diets high in certain fats 
and sugars deplete anti-inflammatory bacteria, thin the mucous layer 
and foster systemic inflammation. Potentially dangerous opportun- 
ists bloom. In one intervention on human volunteers, University of 
California, San Francisco, microbiologist Peter Turnbaugh found that 
switching to a high-fat, high-protein diet spurred an expansion of 
bile-tolerant bacteria, one of which, Bilophila wadsworthia, has been 
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Your Microbes at Work: ) 
Fiber Fermenters Keep Us Healthy 


The gut houses trillions of microbes. They eat what you eat. Many specialize in fermenting the soluble fiber 

in legumes, grains, fruits and vegetables. Certain microbial species are adept at colonizing the mucous layer 
of the gut. Mucus contains antimicrobial substances that keep the microbiota at a slight distance. But it also 
contains sugars such as those found in breast milk. Some microbes, often the same ones that specialize in 
fermenting fiber, can use these sugars as sustenance when other food is not available. The by-products of 
fiber fermentation nourish cells lining the colon. Some by-products pass into the circulation and may calibrate 
our immune system in a way that prevents inflammatory disorders such as asthma and Crohn's disease. 


FAECALIBACTERIUM PRAUSNITZII DISRUPTED MICROBIOME 


When the mucous layer is 


Other benign bacteria reduced, opportunists can move 


close to the gut lining, inciting 
Outer mucous layer _ inflammation. Fermentation of 
fiber seems to keep the mucous 
~ Inner mucous layer layer intact. So does the presence 
=" of peacekeeping microbes, such 
as F. prausnitzii. Fiber may keep 


& the peacekeepers healthy, the 


mucous layer thick and the 
immune system well calibrated. 
ee 


F. PRAUSNITZII 


BUTYRATE AND/OR FIBER ABSENT 


oe, Tes, . 


AOLIOM 


ne _ 
“= 


Vom 


Reduced 
mucous layer 


Epithelial cell 


Primed immuno- 
suppressive cells (Tregs) 


Blocked 
inflammatory 
response 


| 


RESIDENT PEACEKEEPER Inflammatory 
F. prausnitzii colonizes the mucous layer ~~ response 
and produces by-products such as butyrate via ~= unfettered 
fermentation. These short-chain fatty acids seem to have an 
anti-inflammatory effect by inducing regulatory T cells (Tregs), 
which in turn control aggressive aspects of the immune system. 
The absence of F. prausnitzii and other microbes that perform similar 
functions often correlates with diseases, such as inflammatory bowel disease 
and obesity. Some of its relatives, called clostridial clusters, have similar properties. 
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linked to inflammatory bowel disease. On the other hand, preventing 
this skewing of the microbial self does not seem that difficult. In 
rodents, adding fermentable fiber to a diet otherwise high in fat keeps 
the “good” microbes happy, the mucous layer healthy and the gut bar- 
rier intact, and it prevents systemic inflammation. Taken together, these 
studies suggest that it is not only what is in your food that matters for 
your health but also what is missing. 

The human studies are even more intriguing. Mounting evidence 
suggests that the systemic inflammation observed in obesity does not 
just result from the accumulation of fat but contributes to it. Scientists 
at Catholic University of Louvain in Belgium recently showed that add- 
ing inulin, a fermentable fiber, to the diet of obese women increased 
counts of F prausnitzii and other clostridial bacteria and reduced that 
dangerous systemic inflammation. Weight loss was minor, but later 
analysis of this and two similar studies revealed that the intervention 
worked best on patients who, at the outset, already harbored clostridial 
clusters IV, IX and XIVa—some of the same clusters represented in 
Honda’s cocktail. Those without the bacteria did not benefit, which 
suggests that once species disappear from the “microbial organ,” the 
associated functions might also vanish. These individuals might not 
require ecosystem engineering so much as an ecosystem restoration. 

That possibility has also been tested. Several years ago Max 
Nieuwdorp, a gastroenterologist at the Academic Medical Center in 
Amsterdam, transplanted microbes from lean donors to patients 
recently diagnosed with metabolic syndrome, a cluster of symptoms 
that often predicts type 2 diabetes. The recipients saw improvements 
in insulin sensitivity and an enrichment of their microbiota, includ- 
ing among those clostridial species. But six months after the trans- 
plant the patients had relapsed, metabolic improvements had faded 
and their microbes had reverted to their original states. 

To Sonnenburg, this outcome suggests that the dance between 
human host and microbial community has considerable momentum. 
Removing the “diseased” ecosystem and installing a new one may not 
overcome the inertia. The gut immune system may simply mold the 
new community in the image of the old. That may explain why fecal 
transplants, which effectively vanquish C. difficile—associated diarrhea, 
have so far failed to treat inflammatory bowel disease. The former is 
caused by a single opportunist; the latter may be driven by an out-of 
whack ecosystem and our response to the microbial derangement. 

To overcome the inertia, Sonnenburg foresees treating the host 
and the microbiota simultaneously. The idea has not been tested, but 
he imagines clearing out the microbiota, perhaps with antibiotics, fol- 
lowed by immunosuppressants to quiet the patient’s immune system 
and allow healing. Only then might the new community of microbes 
stick and successfully recalibrate the immune system. 


EVOLUTION OF MOBILITY 
WHEN ANIMAL LIFE EXPLODED some 800 million years ago, microbes 
had already existed on Earth for maybe three billion years. A major 
innovation in animal evolution was the gut—a tube that takes nutri- 
ents in one end and expels waste from the other. It is even possible, 
argues Margaret McFall-Ngai, a microbiologist at the University of 


DANGEROUS OPPORTUNIST: Bilophila wadsworthia, a species of 


bacterium linked to inflammatory bowel disease, bloomed in the microbiota 


of human volunteers fed a high-fat, high-protein diet in a recent experiment. 


Wisconsin—Madison, that microbes drove the evolution of the gut 
directly. Plants only succeeded in colonizing land when they had 
developed relationships with microbes that helped them extract vital 
nutrients from soil. Perhaps one evolutionary innovation of animals 
was to scoop up the microbial communities necessary for survival and 
to take them along for the ride, achieving mobility. 

Mucus may be one way the human gut selects for these microbes. 
Only co-adapted bacteria, Sonnenburg thinks, can metabolize the 
complex sugars it contains. A cornerstone of this symbiosis may be the 
simple imperative of acquiring nutrients in a world of scarcity. We hunt 
and gather the goods; the microbes ferment what we cannot digest, 
taking a cut in the process and keeping pathogens at bay. Our immune 
systems quiet down when they receive signals, conveyed partly in 
microbial metabolites, indicating that the right microbes are in place. 

‘The field of gut microbiome research has already moved from the 
idea of describing the core species to identifying the core ecological 
functions various microbes perform. Many potential species may fulfill 
any given role. Now another concept may be emerging, which might 
be called the keystone relationship. “The interaction between fiber and 
microbes that consume it,” Sonnenburg says, “is the fundamental key- 
stone interaction that everything else is built on in the gut.” It may lie 
at the heart of the symbiotic pact between microbes and humans. 
Moises Velasquez-Manoff is author of An Epidemic of Absence: 

A New Way of Understanding Allergies and Autoimmune Diseases. His 
work has appeared in the New York Times, Mother Jones and Nautilus. 
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THE NOTION THAT THE STATE of our gut governs our state of mind dates back more than 100 years. 
Many 19th- and early 20th-century scientists believed that accumulating wastes in the colon 
triggered a state of “auto-intoxication,” whereby poisons emanating from the gut produced infec- 
tions that were in turn linked with depression, anxiety and psychosis. Patients were treated with 
colonic purges and even bowel surgeries until these practices were dismissed as quackery. 
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The ongoing exploration of the 
human microbiome promises to 
bring the link between the gut and 
the brain into clearer focus. Scientists 
are increasingly convinced that the 
vast assemblage of microfauna in our 
intestines may have a major impact 
on our state of mind. The gut-brain 
axis seems to be bidirectional—the 
brain acts on gastrointestinal and 
immune functions that help to shape 
the gut’s microbial makeup, and gut 
microbes make neuroactive com- 
pounds, including neurotransmitters 
and metabolites that also act on the 
brain. These interactions could occur 
in various ways: microbial com- 
pounds communicate via the vagus 
nerve, which connects the brain and 
the digestive tract, and microbially 
derived metabolites interact with the 
immune system, which maintains its 
own communication with the brain. 
Sven Pettersson, a microbiologist at 
the Karolinska Institute in Stock- 
holm, has recently shown that gut 
microbes help to control leakage 
through both the intestinal lining 
and the blood-brain barrier, which 
ordinarily protects the brain from 
potentially harmful agents. 

Microbes may have their own evo- 
lutionary reasons for communicating with the brain. They need us to 
be social, says John Cryan, a neuroscientist at University College Cork 
in Ireland, so that they can spread through the human population. 
Cryan’s research shows that when bred in sterile conditions, germ-free 
mice lacking in intestinal microbes also lack an ability to recognize 
other mice with whom they interact. In other studies, disruptions of 
the microbiome induced mice behavior that mimics human anxiety, 
depression and even autism. In some cases, scientists restored more 
normal behavior by treating their test subjects with certain strains of 
benign bacteria. Nearly all the data so far are limited to mice, but 
Cryan believes the findings provide fertile ground for developing 
analogous compounds, which he calls psychobiotics, for humans. 
“That dietary treatments could be used as either adjunct or sole therapy 
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The microbiome may 
yleld a new class of 
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for mood disorders is not beyond the 
realm of possibility,” he says. 


PERSONALITY SHIFTS 
SCIENTISTS USE germ-free mice to 
study how the lack of a microbiome— 
or selective dosing with particular bac- 
teria—alters behavior and brain func- 
tion, “which is something we could 
never do in people,” Cryan says. Entire 
colonies of germ-free mice are bred and 
kept in isolation chambers, and the 
technicians who handle them wear full 
bodysuits, as ifthey were ina biohazard 
facility. As with all mice research, 
extrapolating results to humans is a big 
step. That is especially true with germ- 
free mice because their brains and 
immune systems are underdeveloped, 
and they tend to be more hyperactive 
and daring than normal mice. 

A decade ago a research team led 
by Nobuyuki Sudo, now a professor of 
internal medicine at Kyushu Univer- 
sity in Japan, restrained germ-free 
mice in a narrow tube for up to an 
hour and then measured their stress 
output. The 
detected in the germ-free animals were 


hormone amounts 
far higher than those measured in nor- 
mal control mice exposed to the same 
restraint. These hormones are released 
by the hypothalamic-pituitary-adrenal axis, which in the germ-free 
mice was clearly dysfunctional. But more important, the scientists also 
found they could induce more normal hormonal responses simply by 
pretreating the animals with a single microbe: a bacterium called 
Bifidobacterium infantis. This finding showed for the first time that 
intestinal microbes could influence stress responses in the brain and 
hinted at the possibility of using probiotic treatments to affect brain 
function in beneficial ways. “It really got the field off the ground,” says 
Emeran Mayer, a gastroenterologist and director of the Center for 
Neurobiology of Stress at the University of California, Los Angeles. 
Meanwhile a research team at McMaster University in Ontario 
led by microbiologist Premsyl Bercik and gastroenterologist Stephen 
Collins discovered that if they colonized the intestines of one strain of 
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The Diverse 
Microbiome 


of the Hunter- 
Gatherer 


The Hadza of Tanzania offer 
a snapshot of the co-adaptive 
capacity of the gut ecosystem 


By Stephanie L. Schnorr 


We tend to forget that modern humanity is 
largely sheltered from the last vestiges of wild 
untamed Earth and that our way of life bears 
little resemblance to how our ancestors lived 
during 90 percent of human history. We have 
lost nearly all trace of these former selves— 
and, worse, have marginalized the few 
remaining humans who retain their hunter- 


wandering foragers called the Hadza, who 
have lived for thousands of years in the East 
African Rift Valley ecosystem, tell us an 
immense and precious story about how 
humans, together with their microbial 
evolutionary partners, are adapted to live 
and thrive in a complex natural environment. 
Ongoing research with the Hadza to 
characterize the hunter-gatherer- 
microbiome relationship has yielded not only 
insight into the co-adaptive capacity of this 


appreciation for how versatile human life can 
be. The microbiome is central to our biology. 
It mediates the interaction and exchange of 
information across host-environment 
thresholds such as the mouth, skin and gut. 
The strength and importance of this 
mediation are borne out in the Hadza gut 
microbiota. Their microbiome harbors 
incredibly high taxonomic diversity, indicating 
great ecosystem stability and flexibility. It is 
capable of withstanding the perpetual 
presence of parasites and pathogens and 
can respond to fluctuations in diet caused by 
an unpredictable and seasonally dependent 
food supply. Interestingly, bacterial taxonomic 
abundance is different in Hadza men and 
women. Because of the sexual division of 
labor in Hadza society, men and women tend 
to consume more of their respective foraged 
food resources. The women primarily collect 
and eat tubers and other plant foods. As a 
result, it appears that women carry more 
bacteria to help process the plant fiber in their 
diets. This difference has direct implications 


gatherer identity. In Tanzania, tribes of 


germ-free mice with bacteria taken from the intestines of another 
mouse strain, the recipient animals would take on aspects of the 
donor's personality. Naturally timid mice would become more explor- 
atory, whereas more daring mice would become apprehensive and 
shy. These tendencies suggested that microbial interactions with the 
brain could induce anxiety and mood disorders. 

Bercik and Collins segued into gut-brain research from their initial 
focus on how the microbiome influences intestinal illnesses. People 
who suffer from these conditions often have co-occurring psychiatric 
problems such as anxiety and depression that cannot be fully explained 
as an emotional reaction to being sick. By colonizing germ-free mice 
with the bowel contents of people with irritable bowel syndrome, which 
induces constipation, diarrhea, pain and low-grade inflammation but 
has no known cause, the McMaster’s team reproduced many of the 
same gastrointestinal symptoms. The animals developed leaky intes- 
tines, their immune systems activated, and they produced a barrage of 
pro-inflammatory metabolites, many with known nervous system 
effects. Moreover, the mice also displayed anxious behavior, as indicated 
in a test of their willingness to step down from a short raised platform. 


AUTISM CONNECTION? 
SCIENTISTS HAVE ALSO BEGUN to explore the microbiome’s potential 
role in autism. In 2007 the late Paul Patterson, a neuroscientist and 
developmental biologist at the California Institute of Technology, was 


S14 | NATURE | VOL 518 | 26 FEBRUARY 2015 


microbial ecosystem but also a profound 


for how the gut microbiota may enable Hadza 


intrigued by epidemiological data showing that women who suffer from 
a high, prolonged fever during pregnancy are up to seven times more 
likely to have a child with autism. These data suggested an alternative 
cause for autism besides genetics. To investigate, Patterson induced flu- 
like symptoms in pregnant mice with a viral mimic: an immunostimu- 
lant called polyinosinic:polycytidylic acid, or poly(I:C). He called this 
the maternal immune activation (MIA) model. 

The offspring of Patterson’s MIA mice displayed all three of the core 
features of human autism: limited social interactions, a tendency 
toward repetitive behavior and reduced communication, which he 
assessed by using a special microphone to measure the length and dura- 
tion of their ultrasonic vocalizations. In addition, the mice had leaky 
intestines, which was important because anywhere from 40 to 90 per- 
cent of all children with autism suffer from gastrointestinal symptoms. 

Then Caltech microbiologist Sarkis Mazmanian and his doctoral 
student Elaine Hsiao discovered that MIA mice also have abnormal 
microbiomes. Specifically, two bacterial classes— Clostridia and Bac- 
teroidia—were far more abundant in the MIA offspring than in nor- 
mal mice. Mazmanian acknowledges that these imbalances may not 
be the same as those in humans with autism. But the finding was com- 
pelling, he says, because it suggested that the behavioral state of the 
MIA mice—and perhaps by extension autistic behavior in humans— 
might be rooted in the gut rather than the brain. “That raised a pro- 
vocative question,” Mazmanian says. “If we treated gastrointestinal 
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women to obtain adequate nutrition for 
fertility and reproductive success, despite a 
resource-limited environment. Through our 
work with the Hadza, we have been able to 
contribute to mounting evidence that human 
microbiota exerts a powerful influence on 
host health and survival, especially in natural 
fertility- and subsistence-based populations. 
Comparative analysis of the gut microbiota 
of hunter-gatherers with that of Westernized 
industrial populations is also beginning to yield 
important insights. The microbial diversity 
in industrial groups is far below that of the 
Hadza, as well as those of other rural farming 
communities in Burkina Faso, Malawi and 
South Africa. Whereas a reduction in diversity 
may not seem ideal, it is the predictable 
response of an ecosystem facing a narrow 
range of selective pressures and is therefore 
no less adaptive. Some technological 
interventions, such as hypersanitation, 


aspects of a Westernized way of life have to 
a large extent displaced much of the original 
mutualistic functions of the microbiome in 
stabilizing our bodies against foreign 
microorganisms, allowing us to digest 
unprocessed foods and helping train our 
immune system to effectively fight disease. 
We are just beginning to understand how 
the microbiome evolves over our lifetimes as 
a dynamic and mutualistic ecosystem that 
helps to facilitate human health. Thanks to the 
Hadza, we know that ancient human hunter- 
gatherers must have maintained a direct 
and persistent interface with the natural 
environment. As a result, the ancestral 
human microbiome was almost certainly a 
taxonomically diverse community, providing 
the functional flexibility that accompanied 
global colonization and is our adaptive legacy. 


Stephanie L. Schnorr is a Ph.D. candidate 
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consumption of refined foods and habitual 
use of antibiotics, have had a dramatic 
impact over time on the functional role of the 


microbiome in industrial populations. These in Leipzig, Germany. 


symptoms in the mice, would we see changes in their behavior?” 
Mazmanian and Hsiao investigated by dosing the animals with a 
microbe known for its anti-inflammatory properties, Bacteroides 
fragilis, which also protects mice from experimentally induced colitis. 
Results showed that the treatment fixed intestinal leaks and restored a 
more normal microbiota. It also mitigated the tendency toward repet- 
itive behavior and reduced communication. Mazmanian subse- 
quently found that B. fragilis reverses MIA deficits even in adult mice. 
“So, at least in this mouse model, it suggests features of autism aren't 
hardwired—they’re reversible—and that’s a huge advance,” he says. 


LIMITS OF RESEARCH 

THE HUMAN GUT MICROBIOME evolved to help us in myriad ways: Gut 
microbes make vitamins, break dietary fiber into digestible short-chain 
fatty acids and govern normal functions in the immune system. 
Probiotic treatments such as yogurt supplemented with beneficial 
strains of bacteria are already being used to help treat some gastrointes- 
tinal disorders, such as antibiotic-induced diarrhea. But there are little 
data about probiotic effects on the human brain. 

In a proof-of-concept study Mayer and his colleagues at U.C.L.A. 
uncovered the first evidence that probiotics ingested in food can alter 
human brain function. The researchers gave healthy women yogurt 
twice a day for a month. Then brain scans using functional magnetic 
resonance imaging were taken as the women were shown pictures of 
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A survey of fecal microbiota of 43 subjects revealed a 
more varied mix of gut bacteria phyla among Hadza 
hunter-gatherers compared with urban Italians. 


actors with frightened or angry facial expressions. Normally, such 
images trigger increased activity in emotion-processing areas of the 
brain that leap into action when someone is in a state of heightened 
alert. Anxious people may be uniquely sensitive to these visceral reac- 
tions. But the women on the yogurt diet exhibited a less “reflexive” 
response, “which shows that bacteria in our intestines really do affect 
how we interpret the world,” says gastroenterologist Kirsten Tillisch, 
the study’s principal investigator. Mayer cautions that the results are 
rudimentary. “We simply dont know yet if probiotics will help with 
human anxiety,” he says. “But our research is moving in that direction.” 
Strains of Bifidobacterium, which is common in the gut flora of 
many mammals, including humans, have generated the best results so 
far. Cryan recently published a study in which two varieties of Bifido- 
bacterium produced by his lab were more effective than escitalopram 
(Lexapro) at treating anxious and depressive behavior in a lab mouse 
strain known for pathological anxiety. Although Cryan is optimistic 
that such findings may point the way to the development of psycho- 
biotics, he is wary of hype. “We still need a lot more research into the 
mechanisms by which gut bacteria interact with the brain,” he says. 
Charles Schmidt is a recipient of the National Association of 
Science Writers’ Science in Society Journalism Award. His work 
has appeared in Science, Nature Biotechnology, Nature Medicine 
and the Washington Post. 
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We tend to forget that modern humanity is 
largely sheltered from the last vestiges of wild 
untamed Earth and that our way of life bears 
little resemblance to how our ancestors lived 
during 90 percent of human history. We have 
lost nearly all trace of these former selves— 
and, worse, have marginalized the few 
remaining humans who retain their hunter- 


wandering foragers called the Hadza, who 
have lived for thousands of years in the East 
African Rift Valley ecosystem, tell us an 
immense and precious story about how 
humans, together with their microbial 
evolutionary partners, are adapted to live 
and thrive in a complex natural environment. 
Ongoing research with the Hadza to 
characterize the hunter-gatherer- 
microbiome relationship has yielded not only 
insight into the co-adaptive capacity of this 


appreciation for how versatile human life can 
be. The microbiome is central to our biology. 
It mediates the interaction and exchange of 
information across host-environment 
thresholds such as the mouth, skin and gut. 
The strength and importance of this 
mediation are borne out in the Hadza gut 
microbiota. Their microbiome harbors 
incredibly high taxonomic diversity, indicating 
great ecosystem stability and flexibility. It is 
capable of withstanding the perpetual 
presence of parasites and pathogens and 
can respond to fluctuations in diet caused by 
an unpredictable and seasonally dependent 
food supply. Interestingly, bacterial taxonomic 
abundance is different in Hadza men and 
women. Because of the sexual division of 
labor in Hadza society, men and women tend 
to consume more of their respective foraged 
food resources. The women primarily collect 
and eat tubers and other plant foods. As a 
result, it appears that women carry more 
bacteria to help process the plant fiber in their 
diets. This difference has direct implications 


gatherer identity. In Tanzania, tribes of 


germ-free mice with bacteria taken from the intestines of another 
mouse strain, the recipient animals would take on aspects of the 
donor's personality. Naturally timid mice would become more explor- 
atory, whereas more daring mice would become apprehensive and 
shy. These tendencies suggested that microbial interactions with the 
brain could induce anxiety and mood disorders. 

Bercik and Collins segued into gut-brain research from their initial 
focus on how the microbiome influences intestinal illnesses. People 
who suffer from these conditions often have co-occurring psychiatric 
problems such as anxiety and depression that cannot be fully explained 
as an emotional reaction to being sick. By colonizing germ-free mice 
with the bowel contents of people with irritable bowel syndrome, which 
induces constipation, diarrhea, pain and low-grade inflammation but 
has no known cause, the McMaster’s team reproduced many of the 
same gastrointestinal symptoms. The animals developed leaky intes- 
tines, their immune systems activated, and they produced a barrage of 
pro-inflammatory metabolites, many with known nervous system 
effects. Moreover, the mice also displayed anxious behavior, as indicated 
in a test of their willingness to step down from a short raised platform. 


AUTISM CONNECTION? 
SCIENTISTS HAVE ALSO BEGUN to explore the microbiome’s potential 
role in autism. In 2007 the late Paul Patterson, a neuroscientist and 
developmental biologist at the California Institute of Technology, was 
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microbial ecosystem but also a profound 


for how the gut microbiota may enable Hadza 


intrigued by epidemiological data showing that women who suffer from 
a high, prolonged fever during pregnancy are up to seven times more 
likely to have a child with autism. These data suggested an alternative 
cause for autism besides genetics. To investigate, Patterson induced flu- 
like symptoms in pregnant mice with a viral mimic: an immunostimu- 
lant called polyinosinic:polycytidylic acid, or poly(I:C). He called this 
the maternal immune activation (MIA) model. 

The offspring of Patterson’s MIA mice displayed all three of the core 
features of human autism: limited social interactions, a tendency 
toward repetitive behavior and reduced communication, which he 
assessed by using a special microphone to measure the length and dura- 
tion of their ultrasonic vocalizations. In addition, the mice had leaky 
intestines, which was important because anywhere from 40 to 90 per- 
cent of all children with autism suffer from gastrointestinal symptoms. 

Then Caltech microbiologist Sarkis Mazmanian and his doctoral 
student Elaine Hsiao discovered that MIA mice also have abnormal 
microbiomes. Specifically, two bacterial classes— Clostridia and Bac- 
teroidia—were far more abundant in the MIA offspring than in nor- 
mal mice. Mazmanian acknowledges that these imbalances may not 
be the same as those in humans with autism. But the finding was com- 
pelling, he says, because it suggested that the behavioral state of the 
MIA mice—and perhaps by extension autistic behavior in humans— 
might be rooted in the gut rather than the brain. “That raised a pro- 
vocative question,” Mazmanian says. “If we treated gastrointestinal 
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women to obtain adequate nutrition for 
fertility and reproductive success, despite a 
resource-limited environment. Through our 
work with the Hadza, we have been able to 
contribute to mounting evidence that human 
microbiota exerts a powerful influence on 
host health and survival, especially in natural 
fertility- and subsistence-based populations. 
Comparative analysis of the gut microbiota 
of hunter-gatherers with that of Westernized 
industrial populations is also beginning to yield 
important insights. The microbial diversity 
in industrial groups is far below that of the 
Hadza, as well as those of other rural farming 
communities in Burkina Faso, Malawi and 
South Africa. Whereas a reduction in diversity 
may not seem ideal, it is the predictable 
response of an ecosystem facing a narrow 
range of selective pressures and is therefore 
no less adaptive. Some technological 
interventions, such as hypersanitation, 


aspects of a Westernized way of life have to 
a large extent displaced much of the original 
mutualistic functions of the microbiome in 
stabilizing our bodies against foreign 
microorganisms, allowing us to digest 
unprocessed foods and helping train our 
immune system to effectively fight disease. 
We are just beginning to understand how 
the microbiome evolves over our lifetimes as 
a dynamic and mutualistic ecosystem that 
helps to facilitate human health. Thanks to the 
Hadza, we know that ancient human hunter- 
gatherers must have maintained a direct 
and persistent interface with the natural 
environment. As a result, the ancestral 
human microbiome was almost certainly a 
taxonomically diverse community, providing 
the functional flexibility that accompanied 
global colonization and is our adaptive legacy. 


Stephanie L. Schnorr is a Ph.D. candidate 
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consumption of refined foods and habitual 
use of antibiotics, have had a dramatic 
impact over time on the functional role of the 


microbiome in industrial populations. These in Leipzig, Germany. 


symptoms in the mice, would we see changes in their behavior?” 
Mazmanian and Hsiao investigated by dosing the animals with a 
microbe known for its anti-inflammatory properties, Bacteroides 
fragilis, which also protects mice from experimentally induced colitis. 
Results showed that the treatment fixed intestinal leaks and restored a 
more normal microbiota. It also mitigated the tendency toward repet- 
itive behavior and reduced communication. Mazmanian subse- 
quently found that B. fragilis reverses MIA deficits even in adult mice. 
“So, at least in this mouse model, it suggests features of autism aren't 
hardwired—they’re reversible—and that’s a huge advance,” he says. 


LIMITS OF RESEARCH 

THE HUMAN GUT MICROBIOME evolved to help us in myriad ways: Gut 
microbes make vitamins, break dietary fiber into digestible short-chain 
fatty acids and govern normal functions in the immune system. 
Probiotic treatments such as yogurt supplemented with beneficial 
strains of bacteria are already being used to help treat some gastrointes- 
tinal disorders, such as antibiotic-induced diarrhea. But there are little 
data about probiotic effects on the human brain. 

In a proof-of-concept study Mayer and his colleagues at U.C.L.A. 
uncovered the first evidence that probiotics ingested in food can alter 
human brain function. The researchers gave healthy women yogurt 
twice a day for a month. Then brain scans using functional magnetic 
resonance imaging were taken as the women were shown pictures of 
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A survey of fecal microbiota of 43 subjects revealed a 
more varied mix of gut bacteria phyla among Hadza 
hunter-gatherers compared with urban Italians. 


actors with frightened or angry facial expressions. Normally, such 
images trigger increased activity in emotion-processing areas of the 
brain that leap into action when someone is in a state of heightened 
alert. Anxious people may be uniquely sensitive to these visceral reac- 
tions. But the women on the yogurt diet exhibited a less “reflexive” 
response, “which shows that bacteria in our intestines really do affect 
how we interpret the world,” says gastroenterologist Kirsten Tillisch, 
the study’s principal investigator. Mayer cautions that the results are 
rudimentary. “We simply dont know yet if probiotics will help with 
human anxiety,” he says. “But our research is moving in that direction.” 
Strains of Bifidobacterium, which is common in the gut flora of 
many mammals, including humans, have generated the best results so 
far. Cryan recently published a study in which two varieties of Bifido- 
bacterium produced by his lab were more effective than escitalopram 
(Lexapro) at treating anxious and depressive behavior in a lab mouse 
strain known for pathological anxiety. Although Cryan is optimistic 
that such findings may point the way to the development of psycho- 
biotics, he is wary of hype. “We still need a lot more research into the 
mechanisms by which gut bacteria interact with the brain,” he says. 
Charles Schmidt is a recipient of the National Association of 
Science Writers’ Science in Society Journalism Award. His work 
has appeared in Science, Nature Biotechnology, Nature Medicine 
and the Washington Post. 


26 FEBRUARY 2015 | VOL 518 | NATURE | S815 


© 2015 Macmillan Publishers Limited. All rights reserved 


